---
license: apache-2.0
base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
tags:
- audio
- automatic-speech-recognition
- lora
- peft
- voxtral
- voxtral-realtime
- affect-tagging
- expressive-tags
- half-duplex
- elevenlabs-tags
- raft
- rejection-sampling
- rlhf
library_name: peft
pipeline_tag: automatic-speech-recognition
---

# Evoxtral-Realtime RL (Recipe I + RAFT — production default)

LoRA adapter on top of [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) that emits ElevenLabs-style expressive tags (`[whispers]`, `[sighs]`, `[laughs]`, `[pause]`, etc.) from audio. **This is the production default** for the half-duplex AI-therapist Mode B hybrid pipeline. RAFT-polished version of [`evoxtral-realtime-sft`](https://huggingface.co/YongkangZOU/evoxtral-realtime-sft).

## What changed vs SFT

This adapter starts from the [SFT checkpoint](https://huggingface.co/YongkangZOU/evoxtral-realtime-sft) and runs **Stage 2 RAFT** ([Reward rAnked FineTuning, Dong et al. 2023](https://arxiv.org/abs/2304.06767)):

1. **Generate** — sample N=4 completions per training input from the SFT model at temperature=0.7 (3232 total samples).
2. **Score** — rule-based reward `0.4 × wer_accuracy + 0.4 × tag_f1 + 0.2 × (1 − hallucination_rate)`.
3. **Curate** — keep the highest-reward completion per sample, drop the bottom 10%. ~727 curated samples remain.
4. **SFT-on-curated** — 1 epoch (46 steps) at lr=5e-5 from the SFT checkpoint.

Effect vs SFT alone: **−5pp hallucination rate** (61% → 53% with `top_k=2` filter), slightly fewer tags emitted on average, Tag F1 / Recall ≈ flat. RAFT is marginal here because the rule-based reward lacks an absolute anti-overemit term — it ranks by *rate* of wrong tags, not total count, so over-emitting fallback patterns survive curation. See the project's [`prior_work.md` Phase 4](https://github.com/Tame-Your-Monkey/evoxtral-realtime/blob/main/.claude/docs/prior_work.md) for the full diagnosis.

## Architecture: Moshi-style backchannel

This adapter is **tag-only** — it does NOT produce ASR text. Pair with frozen base for ASR; merge outputs at inference:

```
audio ─┬─ base Voxtral-Mini-4B-Realtime-2602 ─→ ASR text (clean WER ~10%)
       └─ this adapter (LoRA + RAFT)        ─→ tag stream → top_k=2 filter

merged: "[whispers] [pause] Listen, I know you're in a meeting"
```

The dual-channel pattern is inspired by Moshi's parallel-stream design, adapted to Voxtral Realtime's element-wise audio-text fusion architecture. Reference Mode B implementation: [`serve_modal.py`](https://github.com/Tame-Your-Monkey/evoxtral-realtime/blob/main/training/scripts/serve_modal.py) (Modal-deployed FastAPI, two model instances on a single A100-40, parallel forward via `asyncio.gather`, top-K filter, JSON merged output).

## Performance (50-sample test set, greedy)

| Metric | Base | SFT only | **RL (this) raw** | **RL + top_k=2 filter (production)** |
|---|---|---|---|---|
| Tag F1 | 22% | 28% | 28% | **29%** ⭐ |
| Tag Recall | 22% | 51% | 50% | 42% |
| Tag Precision | 100% | 34% | 37% | **47%** |
| Tag Hallucination | 0% | 61% | 57% | **53%** |
| WER (text from base) | 10% | n/a | n/a | 10% (unchanged) |

Production config = **this adapter + base for ASR + `top_k=2` inference filter** = the right of the above table.

## Quick start

```python
import torch
from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor
from peft import PeftModel

processor = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-4B-Realtime-2602")
base = VoxtralRealtimeForConditionalGeneration.from_pretrained(
    "mistralai/Voxtral-Mini-4B-Realtime-2602",
    dtype=torch.bfloat16,
    device_map="auto",
)
tag_model = PeftModel.from_pretrained(base, "YongkangZOU/evoxtral-realtime-rl")
tag_model.eval()
# Use `base` for ASR text, `tag_model` for tag stream — see serve_modal.py for the full hybrid.
```

For end-to-end use (POST audio file → JSON with `text`, `tags_filtered`, `merged`), the project repo ships a Modal-deployed FastAPI server with parallel forward + top-K filter built in.

## Training details

**Stage 1 inheritance** — see the [SFT card](https://huggingface.co/YongkangZOU/evoxtral-realtime-sft) for: v1-style packed schema, tags-only target, LoRA r=16/α=64 attention-only, frozen audio path.

**Stage 2 RAFT additions:**

- **Method:** RAFT (rejection sampling + plain SFT). No critic, no KL clipping, no learned reward model.
- **Generation:** N=4 × 808 train samples = 3232 completions, temperature=0.7, top_p=0.9, max_new_tokens=64. ~33 min on A100-40.
- **Reward function:** `0.4 × (1 − WER) + 0.4 × tag_f1 + 0.2 × (1 − hall_rate)` (rule-based; for backchannel adapter the WER term is constant 0 since pred has no text content, so reward effectively scores tag quality).
- **Curated set:** 727 samples after bottom-10% reward filter.
- **SFT-on-curated:** 1 epoch (46 steps), lr=5e-5, cosine schedule, warmup=20, **gradient_checkpointing=False** (PeftModel.from_pretrained + checkpointing crashes on the in-place audio add — see project cheat-sheet).
- **Trainable:** 16.2 M of 4.5 B (0.36%). Slightly higher than SFT due to PeftModel.from_pretrained loading.
- **Hardware:** Modal A100-40GB, bf16, ~3 min runtime.

## RAFT pitfalls discovered along the way

The RAFT pipeline (`rl_modal.py` in the project repo) needed five fixes vs the original Stage 2 design before it ran clean. Documented here for future RAFT-on-Voxtral-Realtime users:

1. **Audio pre-pad missing** — generation must pre-pad raw audio to `AUDIO_MAX_SAMPLES=240_480` to match the train/eval audio path.
2. **Mel mod-8 padding missing** — encoder reshape requires `T_mel % 8 == 0`.
3. **`max_new_tokens=512` excessive** for backchannel — tag-only outputs are ~5-10 tokens; reduced to 64.
4. **`num_delay_tokens` scalar tensor breaks `num_return_sequences > 1`** in HF generate's `_expand_inputs_for_generation`. Drop the key before calling generate.
5. **`PeftModel.from_pretrained` + `gradient_checkpointing=True` crashes** on the in-place audio add at `modeling_voxtral_realtime.py:1078`. PeftModel.from_pretrained doesn't auto-freeze base params (unlike `get_peft_model`), and the checkpointing hook combined with frozen embeddings makes `inputs_embeds` a leaf-with-grad. Disable gradient_checkpointing for RAFT.

See the [hard-won facts cheat-sheet](https://github.com/Tame-Your-Monkey/evoxtral-realtime/blob/main/.claude/CLAUDE.md#hard-won-facts-about-voxtral-realtime-training) for the full set of Voxtral Realtime training pitfalls.

## Limitations

- **Default-emit fallback persists.** On uncertain audio, model still emits `[calm] [pause] [clears throat]` as a default set. RAFT trims this slightly but doesn't eliminate it. Data-side limitation: TTS-synthesized affect signal is too weak to differentiate ambiguous inputs.
- **Best with `top_k=2` filter.** Raw output over-emits ~4-6 tags per utterance. Inference-time top-K filter is the production config.
- **TTS dataset.** Trained on ElevenLabs-synthesized audio. Real clinical recordings out of distribution.
- **Tag taxonomy fixed.** 15 base tags. Out-of-taxonomy concepts won't be tagged.
- **English only.**

## See also

- ⚙️ [`YongkangZOU/evoxtral-realtime-sft`](https://huggingface.co/YongkangZOU/evoxtral-realtime-sft) — the SFT-only baseline that this adapter was bootstrapped from.
- 🏗️ [Project repository](https://github.com/Tame-Your-Monkey/evoxtral-realtime) — full pipeline, evaluation harness, Mode B hybrid serve (`serve_modal.py`), design docs.
- 🎙️ [Voxtral-Mini-4B-Realtime-2602](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) — required base model.
- 📄 [RAFT paper (Dong et al. 2023)](https://arxiv.org/abs/2304.06767) — the Reward rAnked FineTuning method this adapter uses for Stage 2.

## License

Apache-2.0, matching the base Voxtral Realtime license.

## Citation

```bibtex
@software{evoxtral_realtime_2026,
  title  = {Evoxtral-Realtime: RAFT-polished backchannel adapter for Voxtral-Mini-4B-Realtime},
  author = {Yongkang Zou},
  year   = {2026},
  url    = {https://github.com/Tame-Your-Monkey/evoxtral-realtime}
}

@misc{voxtral_mini_realtime,
  author = {Mistral AI},
  title  = {Voxtral-Mini-4B-Realtime-2602},
  year   = {2026},
  url    = {https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602}
}

@misc{dong2023raft,
  title  = {RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment},
  author = {Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
  year   = {2023},
  eprint = {2304.06767},
  url    = {https://arxiv.org/abs/2304.06767}
}
```