---
library_name: vllm
pipeline_tag: automatic-speech-recognition
license: apache-2.0
base_model:
- mistralai/Voxtral-Mini-4B-Realtime-2602
inference: false
tags:
- resilientchallenge2026
- voxtral
- vllm
- audio-to-text
---

# RCIA Round 1 - Voxtral Mini Realtime vLLM ctx512

This repository is the Round 1 audio-to-text submission package for Voxtral Mini 4B Realtime.

The model weights are the unmodified BF16 checkpoint. The submission focuses on
vLLM inference configuration: four parameters tuned away from their defaults give
a measured **-21% P50 latency** improvement over the vLLM default configuration
on our 500-sample benchmark, with a 100% completion rate.

Additional background on the Round 1 exploration process is included in
`ROUND1_METHODOLOGY.md`.

## Optimization Target

This Round 1 package is intentionally optimized for **P50 latency** (median
user experience) while keeping transcription quality and full-run stability.
Tail latencies also improve versus the baseline, but less than P50:

- P50: about **-21%**
- P90: about **-5%**
- P95: about **-1%** (near-neutral in strict orga-like reruns)

In other words, the main gain is faster typical requests, with WER preserved
and no regressions in completion rate on the 500-sample validation.

## What Changed vs Default vLLM Configuration

| Parameter | vLLM default | This submission | Why |
| --- | --- | --- | --- |
| `max_model_len` | 131072 | **512** | Voxtral audio inputs rarely exceed 300 tokens. Allocating 131k tokens wastes VRAM on KV cache blocks that are never used, adding latency at startup and request time. |
| `gpu_memory_utilization` | 0.90 | **0.65** | The 0.60 and 0.40 benchmark profiles were measured on an RTX 4090 24GB. The submission keeps the same absolute VRAM target, about 10 GB, which maps to 0.65 on a 16GB L4. |
| `enforce_eager` | false | **true** | CUDA graph capture fails or is unstable with the audio multimodal path in vLLM 0.20.0. Eager mode avoids the issue with no measurable throughput penalty at `max_num_seqs=1`. |
| `max_num_seqs` | 256 | **1** | Transcription is a one-request-at-a-time workload. Reducing the batch limit eliminates memory reserved for concurrent sequences that never arrive. |

## Validation

Local validation was run on 500 audio samples with the same client harness used throughout the strategy search.

| Package | Samples | P50 latency | P90 latency | P95 latency | WER | Peak VRAM |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| `strategy-4044-vllm-ctx512` | 500/500 | 2635.99 ms | 6258.58 ms | 7725.58 ms | 0.0179907 | 14.52 GB |
| `strategy-4056-vllm-ctx512-gpu040` | 500/500 | 2680.05 ms | 6334.8 ms | 7862.63 ms | 0.0179907 | 9.83 GB |

The benchmark table above records the two measured profiles on an RTX 4090 24GB (`gpu_memory_utilization=0.6` and `0.4`). Rather than reusing the fraction directly on a 16GB L4, the packaged submission keeps the validated absolute VRAM budget, about 9.8-10 GB from the low-VRAM profile, which corresponds to `gpu_memory_utilization=0.65` on the contest GPU.

## Submission Rationale

For Round 1, we focused on the vLLM inference configuration because we did not have sufficient compute available to run serious weight-compression work until the final week, when we gained access to an RTX 4090. Given the short validation window, we prioritized a reproducible and robust submission path: keep the BF16 checkpoint format, reduce the effective runtime footprint through vLLM configuration, and validate latency, memory use, and WER across the available benchmark set. After revisiting the absolute VRAM budget on the 16GB contest L4, we kept the validated low-VRAM footprint from the 24GB RTX 4090, about 10 GB, which led to `gpu_memory_utilization=0.65` instead of reusing the earlier `0.40` fraction.

Weight-compression experiments (structured pruning, FP8) were evaluated but not retained because they either failed to start reliably or degraded WER to ~21% or higher.

## Why sitecustomize.py is Required

vLLM 0.20.0 does not fully support the `mistral-common` tokenizer backend that
Voxtral uses. Three separate crashes occur on a stock vLLM 0.20.0 installation:

**1. `AttributeError: CachedMistralCommonBackend has no attribute is_fast`**
vLLM checks `tokenizer.is_fast` during server startup. The
`MistralCommonBackend` class in `transformers` does not expose this attribute.
Fix: set `is_fast = False` on both `PreTrainedTokenizerBase` and
`MistralCommonBackend` before vLLM's tokenizer code runs.

**2. `TypeError: from_pretrained() got unexpected keyword argument 'subfolder'`**
vLLM passes `subfolder`, `tokenizer`, and `_from_auto` kwargs to
`MistralCommonBackend.from_pretrained`, which does not accept them.
Fix: wrap `from_pretrained` to silently drop those three kwargs.

**3. `AttributeError: VoxtralRealtimeProcessor has no attribute _get_num_multimodal_tokens`**
When the model directory contains only `model.safetensors` (HF format, no
`consolidated.safetensors`), vLLM uses the HF loader path and calls
`processor._get_num_multimodal_tokens()` to pre-allocate the audio token
buffer. `VoxtralRealtimeProcessor` does not implement this method.
Fix: add a stub that returns a conservative budget of 480 tokens
(max_model_len=512 minus a small text headroom).

Python's `sitecustomize.py` is the standard mechanism for applying runtime
patches: it is imported automatically by the interpreter before any other user
code, so the patches are in place before vLLM loads its modules. The file is
copied into the Docker image at `/usr/lib/python3.12/sitecustomize.py` by the
`Dockerfile` included in this repository.

Both files (`sitecustomize.py` and `Dockerfile`) are included here so the
deployment is fully self-contained.

## Deployment

Build the patched Docker image from the repository root, then run it:

```bash
# From the cloned repository root
docker build -t vllm-voxtral .
docker run --gpus all --rm \
    --shm-size 4g \
    -v "$(pwd)":/model:ro \
    -w /model \
    -p 8000:8000 \
    vllm-voxtral \
    --config vllm_config.yaml
```

Challenge-side canonical launch from the HF repository root:

```bash
vllm serve --config vllm_config.yaml
```

Why `max_model_len: 512` is explicit in `vllm_config.yaml`:
with vLLM 0.20.0, the model default context (`131072`) can over-allocate
KV-cache and fail initialization on 24 GB GPUs. `512` matches the validated
audio workload and keeps startup reliable.

Equivalent direct-CLI form (without config file):

```bash
VLLM_DISABLE_COMPILE_CACHE=1 \
vllm serve Franck-J/rcia-voxtral-round1 \
  --compilation_config '{"cudagraph_mode":"PIECEWISE"}' \
  --max-model-len 512 \
  --gpu-memory-utilization 0.65 \
  --max-num-seqs 1 \
  --enforce-eager
```

**Option B — apply the patch directly** to the Python installation used by
vLLM (adjust path to match your Python version):

```bash
cp sitecustomize.py /usr/lib/python3.12/sitecustomize.py
vllm serve --config vllm_config.yaml
```

## Inference

Once the server is healthy:

Expected endpoint:

```text
http://localhost:8000/v1
```

## vLLM Configuration

```yaml
model: .
dtype: bfloat16
enforce_eager: true
max_num_seqs: 1
gpu_memory_utilization: 0.65
max_model_len: 512
compilation_config:
  cudagraph_mode: PIECEWISE
```

## Compression And Optimization Notes

The following Round 1 candidates were explored:

- FP8 KV-cache runtime configuration: server did not become healthy.
- Direct Hugging Face runtime wrapper: functional, but slower than the vLLM baseline.
- Structured pruning runtime: produced a smaller checkpoint, but WER degraded to about 21.22%.
- TensorRT wrapper track: export contract probe was prepared, but not a validated submission runtime.
- vLLM context and memory sweep: `ctx512` with `gpu_memory_utilization: 0.60` was the best measured package on RTX 4090 24GB; the published submission uses `0.65` to preserve a comparable absolute VRAM budget on L4 16GB.

Because the weight-compressed candidates were not competitive, this package prioritizes technical validation and transcription quality for Round 1.

## License

The base model is released under Apache-2.0. This submission keeps the same license.