--- library_name: vllm pipeline_tag: automatic-speech-recognition license: apache-2.0 base_model: - mistralai/Voxtral-Mini-4B-Realtime-2602 inference: false tags: - resilientchallenge2026 - voxtral - vllm - audio-to-text --- # RCIA Round 1 - Voxtral Mini Realtime vLLM ctx512 This repository is the Round 1 audio-to-text submission package for Voxtral Mini 4B Realtime. The model weights are the unmodified BF16 checkpoint. The submission focuses on vLLM inference configuration: four parameters tuned away from their defaults give a measured **-21% P50 latency** improvement over the vLLM default configuration on our 500-sample benchmark, with a 100% completion rate. Additional background on the Round 1 exploration process is included in `ROUND1_METHODOLOGY.md`. ## Optimization Target This Round 1 package is intentionally optimized for **P50 latency** (median user experience) while keeping transcription quality and full-run stability. Tail latencies also improve versus the baseline, but less than P50: - P50: about **-21%** - P90: about **-5%** - P95: about **-1%** (near-neutral in strict orga-like reruns) In other words, the main gain is faster typical requests, with WER preserved and no regressions in completion rate on the 500-sample validation. ## What Changed vs Default vLLM Configuration | Parameter | vLLM default | This submission | Why | | --- | --- | --- | --- | | `max_model_len` | 131072 | **512** | Voxtral audio inputs rarely exceed 300 tokens. Allocating 131k tokens wastes VRAM on KV cache blocks that are never used, adding latency at startup and request time. | | `gpu_memory_utilization` | 0.90 | **0.65** | The 0.60 and 0.40 benchmark profiles were measured on an RTX 4090 24GB. The submission keeps the same absolute VRAM target, about 10 GB, which maps to 0.65 on a 16GB L4. | | `enforce_eager` | false | **true** | CUDA graph capture fails or is unstable with the audio multimodal path in vLLM 0.20.0. Eager mode avoids the issue with no measurable throughput penalty at `max_num_seqs=1`. | | `max_num_seqs` | 256 | **1** | Transcription is a one-request-at-a-time workload. Reducing the batch limit eliminates memory reserved for concurrent sequences that never arrive. | ## Validation Local validation was run on 500 audio samples with the same client harness used throughout the strategy search. | Package | Samples | P50 latency | P90 latency | P95 latency | WER | Peak VRAM | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | `strategy-4044-vllm-ctx512` | 500/500 | 2635.99 ms | 6258.58 ms | 7725.58 ms | 0.0179907 | 14.52 GB | | `strategy-4056-vllm-ctx512-gpu040` | 500/500 | 2680.05 ms | 6334.8 ms | 7862.63 ms | 0.0179907 | 9.83 GB | The benchmark table above records the two measured profiles on an RTX 4090 24GB (`gpu_memory_utilization=0.6` and `0.4`). Rather than reusing the fraction directly on a 16GB L4, the packaged submission keeps the validated absolute VRAM budget, about 9.8-10 GB from the low-VRAM profile, which corresponds to `gpu_memory_utilization=0.65` on the contest GPU. ## Submission Rationale For Round 1, we focused on the vLLM inference configuration because we did not have sufficient compute available to run serious weight-compression work until the final week, when we gained access to an RTX 4090. Given the short validation window, we prioritized a reproducible and robust submission path: keep the BF16 checkpoint format, reduce the effective runtime footprint through vLLM configuration, and validate latency, memory use, and WER across the available benchmark set. After revisiting the absolute VRAM budget on the 16GB contest L4, we kept the validated low-VRAM footprint from the 24GB RTX 4090, about 10 GB, which led to `gpu_memory_utilization=0.65` instead of reusing the earlier `0.40` fraction. Weight-compression experiments (structured pruning, FP8) were evaluated but not retained because they either failed to start reliably or degraded WER to ~21% or higher. ## Why sitecustomize.py is Required vLLM 0.20.0 does not fully support the `mistral-common` tokenizer backend that Voxtral uses. Three separate crashes occur on a stock vLLM 0.20.0 installation: **1. `AttributeError: CachedMistralCommonBackend has no attribute is_fast`** vLLM checks `tokenizer.is_fast` during server startup. The `MistralCommonBackend` class in `transformers` does not expose this attribute. Fix: set `is_fast = False` on both `PreTrainedTokenizerBase` and `MistralCommonBackend` before vLLM's tokenizer code runs. **2. `TypeError: from_pretrained() got unexpected keyword argument 'subfolder'`** vLLM passes `subfolder`, `tokenizer`, and `_from_auto` kwargs to `MistralCommonBackend.from_pretrained`, which does not accept them. Fix: wrap `from_pretrained` to silently drop those three kwargs. **3. `AttributeError: VoxtralRealtimeProcessor has no attribute _get_num_multimodal_tokens`** When the model directory contains only `model.safetensors` (HF format, no `consolidated.safetensors`), vLLM uses the HF loader path and calls `processor._get_num_multimodal_tokens()` to pre-allocate the audio token buffer. `VoxtralRealtimeProcessor` does not implement this method. Fix: add a stub that returns a conservative budget of 480 tokens (max_model_len=512 minus a small text headroom). Python's `sitecustomize.py` is the standard mechanism for applying runtime patches: it is imported automatically by the interpreter before any other user code, so the patches are in place before vLLM loads its modules. The file is copied into the Docker image at `/usr/lib/python3.12/sitecustomize.py` by the `Dockerfile` included in this repository. Both files (`sitecustomize.py` and `Dockerfile`) are included here so the deployment is fully self-contained. ## Deployment Build the patched Docker image from the repository root, then run it: ```bash # From the cloned repository root docker build -t vllm-voxtral . docker run --gpus all --rm \ --shm-size 4g \ -v "$(pwd)":/model:ro \ -w /model \ -p 8000:8000 \ vllm-voxtral \ --config vllm_config.yaml ``` Challenge-side canonical launch from the HF repository root: ```bash vllm serve --config vllm_config.yaml ``` Why `max_model_len: 512` is explicit in `vllm_config.yaml`: with vLLM 0.20.0, the model default context (`131072`) can over-allocate KV-cache and fail initialization on 24 GB GPUs. `512` matches the validated audio workload and keeps startup reliable. Equivalent direct-CLI form (without config file): ```bash VLLM_DISABLE_COMPILE_CACHE=1 \ vllm serve Franck-J/rcia-voxtral-round1 \ --compilation_config '{"cudagraph_mode":"PIECEWISE"}' \ --max-model-len 512 \ --gpu-memory-utilization 0.65 \ --max-num-seqs 1 \ --enforce-eager ``` **Option B — apply the patch directly** to the Python installation used by vLLM (adjust path to match your Python version): ```bash cp sitecustomize.py /usr/lib/python3.12/sitecustomize.py vllm serve --config vllm_config.yaml ``` ## Inference Once the server is healthy: Expected endpoint: ```text http://localhost:8000/v1 ``` ## vLLM Configuration ```yaml model: . dtype: bfloat16 enforce_eager: true max_num_seqs: 1 gpu_memory_utilization: 0.65 max_model_len: 512 compilation_config: cudagraph_mode: PIECEWISE ``` ## Compression And Optimization Notes The following Round 1 candidates were explored: - FP8 KV-cache runtime configuration: server did not become healthy. - Direct Hugging Face runtime wrapper: functional, but slower than the vLLM baseline. - Structured pruning runtime: produced a smaller checkpoint, but WER degraded to about 21.22%. - TensorRT wrapper track: export contract probe was prepared, but not a validated submission runtime. - vLLM context and memory sweep: `ctx512` with `gpu_memory_utilization: 0.60` was the best measured package on RTX 4090 24GB; the published submission uses `0.65` to preserve a comparable absolute VRAM budget on L4 16GB. Because the weight-compressed candidates were not competitive, this package prioritizes technical validation and transcription quality for Round 1. ## License The base model is released under Apache-2.0. This submission keeps the same license.