--- license: apache-2.0 base_model: mistralai/Voxtral-Mini-4B-Realtime-2602 tags: - audio - automatic-speech-recognition - lora - peft - voxtral - voxtral-realtime - affect-tagging - expressive-tags - half-duplex - elevenlabs-tags - raft - rejection-sampling - rlhf library_name: peft pipeline_tag: automatic-speech-recognition --- # Evoxtral-Realtime RL (Recipe I + RAFT — production default) LoRA adapter on top of [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) that emits ElevenLabs-style expressive tags (`[whispers]`, `[sighs]`, `[laughs]`, `[pause]`, etc.) from audio. **This is the production default** for the half-duplex AI-therapist Mode B hybrid pipeline. RAFT-polished version of [`evoxtral-realtime-sft`](https://huggingface.co/YongkangZOU/evoxtral-realtime-sft). ## What changed vs SFT This adapter starts from the [SFT checkpoint](https://huggingface.co/YongkangZOU/evoxtral-realtime-sft) and runs **Stage 2 RAFT** ([Reward rAnked FineTuning, Dong et al. 2023](https://arxiv.org/abs/2304.06767)): 1. **Generate** — sample N=4 completions per training input from the SFT model at temperature=0.7 (3232 total samples). 2. **Score** — rule-based reward `0.4 × wer_accuracy + 0.4 × tag_f1 + 0.2 × (1 − hallucination_rate)`. 3. **Curate** — keep the highest-reward completion per sample, drop the bottom 10%. ~727 curated samples remain. 4. **SFT-on-curated** — 1 epoch (46 steps) at lr=5e-5 from the SFT checkpoint. Effect vs SFT alone: **−5pp hallucination rate** (61% → 53% with `top_k=2` filter), slightly fewer tags emitted on average, Tag F1 / Recall ≈ flat. RAFT is marginal here because the rule-based reward lacks an absolute anti-overemit term — it ranks by *rate* of wrong tags, not total count, so over-emitting fallback patterns survive curation. See the project's [`prior_work.md` Phase 4](https://github.com/Tame-Your-Monkey/evoxtral-realtime/blob/main/.claude/docs/prior_work.md) for the full diagnosis. ## Architecture: Moshi-style backchannel This adapter is **tag-only** — it does NOT produce ASR text. Pair with frozen base for ASR; merge outputs at inference: ``` audio ─┬─ base Voxtral-Mini-4B-Realtime-2602 ─→ ASR text (clean WER ~10%) └─ this adapter (LoRA + RAFT) ─→ tag stream → top_k=2 filter merged: "[whispers] [pause] Listen, I know you're in a meeting" ``` The dual-channel pattern is inspired by Moshi's parallel-stream design, adapted to Voxtral Realtime's element-wise audio-text fusion architecture. Reference Mode B implementation: [`serve_modal.py`](https://github.com/Tame-Your-Monkey/evoxtral-realtime/blob/main/training/scripts/serve_modal.py) (Modal-deployed FastAPI, two model instances on a single A100-40, parallel forward via `asyncio.gather`, top-K filter, JSON merged output). ## Performance (50-sample test set, greedy) | Metric | Base | SFT only | **RL (this) raw** | **RL + top_k=2 filter (production)** | |---|---|---|---|---| | Tag F1 | 22% | 28% | 28% | **29%** ⭐ | | Tag Recall | 22% | 51% | 50% | 42% | | Tag Precision | 100% | 34% | 37% | **47%** | | Tag Hallucination | 0% | 61% | 57% | **53%** | | WER (text from base) | 10% | n/a | n/a | 10% (unchanged) | Production config = **this adapter + base for ASR + `top_k=2` inference filter** = the right of the above table. ## Quick start ```python import torch from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor from peft import PeftModel processor = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-4B-Realtime-2602") base = VoxtralRealtimeForConditionalGeneration.from_pretrained( "mistralai/Voxtral-Mini-4B-Realtime-2602", dtype=torch.bfloat16, device_map="auto", ) tag_model = PeftModel.from_pretrained(base, "YongkangZOU/evoxtral-realtime-rl") tag_model.eval() # Use `base` for ASR text, `tag_model` for tag stream — see serve_modal.py for the full hybrid. ``` For end-to-end use (POST audio file → JSON with `text`, `tags_filtered`, `merged`), the project repo ships a Modal-deployed FastAPI server with parallel forward + top-K filter built in. ## Training details **Stage 1 inheritance** — see the [SFT card](https://huggingface.co/YongkangZOU/evoxtral-realtime-sft) for: v1-style packed schema, tags-only target, LoRA r=16/α=64 attention-only, frozen audio path. **Stage 2 RAFT additions:** - **Method:** RAFT (rejection sampling + plain SFT). No critic, no KL clipping, no learned reward model. - **Generation:** N=4 × 808 train samples = 3232 completions, temperature=0.7, top_p=0.9, max_new_tokens=64. ~33 min on A100-40. - **Reward function:** `0.4 × (1 − WER) + 0.4 × tag_f1 + 0.2 × (1 − hall_rate)` (rule-based; for backchannel adapter the WER term is constant 0 since pred has no text content, so reward effectively scores tag quality). - **Curated set:** 727 samples after bottom-10% reward filter. - **SFT-on-curated:** 1 epoch (46 steps), lr=5e-5, cosine schedule, warmup=20, **gradient_checkpointing=False** (PeftModel.from_pretrained + checkpointing crashes on the in-place audio add — see project cheat-sheet). - **Trainable:** 16.2 M of 4.5 B (0.36%). Slightly higher than SFT due to PeftModel.from_pretrained loading. - **Hardware:** Modal A100-40GB, bf16, ~3 min runtime. ## RAFT pitfalls discovered along the way The RAFT pipeline (`rl_modal.py` in the project repo) needed five fixes vs the original Stage 2 design before it ran clean. Documented here for future RAFT-on-Voxtral-Realtime users: 1. **Audio pre-pad missing** — generation must pre-pad raw audio to `AUDIO_MAX_SAMPLES=240_480` to match the train/eval audio path. 2. **Mel mod-8 padding missing** — encoder reshape requires `T_mel % 8 == 0`. 3. **`max_new_tokens=512` excessive** for backchannel — tag-only outputs are ~5-10 tokens; reduced to 64. 4. **`num_delay_tokens` scalar tensor breaks `num_return_sequences > 1`** in HF generate's `_expand_inputs_for_generation`. Drop the key before calling generate. 5. **`PeftModel.from_pretrained` + `gradient_checkpointing=True` crashes** on the in-place audio add at `modeling_voxtral_realtime.py:1078`. PeftModel.from_pretrained doesn't auto-freeze base params (unlike `get_peft_model`), and the checkpointing hook combined with frozen embeddings makes `inputs_embeds` a leaf-with-grad. Disable gradient_checkpointing for RAFT. See the [hard-won facts cheat-sheet](https://github.com/Tame-Your-Monkey/evoxtral-realtime/blob/main/.claude/CLAUDE.md#hard-won-facts-about-voxtral-realtime-training) for the full set of Voxtral Realtime training pitfalls. ## Limitations - **Default-emit fallback persists.** On uncertain audio, model still emits `[calm] [pause] [clears throat]` as a default set. RAFT trims this slightly but doesn't eliminate it. Data-side limitation: TTS-synthesized affect signal is too weak to differentiate ambiguous inputs. - **Best with `top_k=2` filter.** Raw output over-emits ~4-6 tags per utterance. Inference-time top-K filter is the production config. - **TTS dataset.** Trained on ElevenLabs-synthesized audio. Real clinical recordings out of distribution. - **Tag taxonomy fixed.** 15 base tags. Out-of-taxonomy concepts won't be tagged. - **English only.** ## See also - ⚙️ [`YongkangZOU/evoxtral-realtime-sft`](https://huggingface.co/YongkangZOU/evoxtral-realtime-sft) — the SFT-only baseline that this adapter was bootstrapped from. - 🏗️ [Project repository](https://github.com/Tame-Your-Monkey/evoxtral-realtime) — full pipeline, evaluation harness, Mode B hybrid serve (`serve_modal.py`), design docs. - 🎙️ [Voxtral-Mini-4B-Realtime-2602](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) — required base model. - 📄 [RAFT paper (Dong et al. 2023)](https://arxiv.org/abs/2304.06767) — the Reward rAnked FineTuning method this adapter uses for Stage 2. ## License Apache-2.0, matching the base Voxtral Realtime license. ## Citation ```bibtex @software{evoxtral_realtime_2026, title = {Evoxtral-Realtime: RAFT-polished backchannel adapter for Voxtral-Mini-4B-Realtime}, author = {Yongkang Zou}, year = {2026}, url = {https://github.com/Tame-Your-Monkey/evoxtral-realtime} } @misc{voxtral_mini_realtime, author = {Mistral AI}, title = {Voxtral-Mini-4B-Realtime-2602}, year = {2026}, url = {https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602} } @misc{dong2023raft, title = {RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment}, author = {Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong}, year = {2023}, eprint = {2304.06767}, url = {https://arxiv.org/abs/2304.06767} } ```