Instructions to use iris-sfg/Voxtral-Mini-4B-Realtime-2602-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use iris-sfg/Voxtral-Mini-4B-Realtime-2602-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Voxtral-Mini-4B-Realtime-2602-4bit iris-sfg/Voxtral-Mini-4B-Realtime-2602-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Voxtral Mini 4B Realtime 4bit (bfloat16)
This is a 4-bit quantized, bfloat16-base MLX conversion of mistralai/Voxtral-Mini-4B-Realtime-2602.
Compared to the older mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit checkpoint, this one keeps the faster mlx-audio runtime layout: quantized tok_embeddings, plus bfloat16 non-quantized weights and quantization scales.
In local mlx-audio streaming benchmarks on a few real audio samples, it ran about 3x faster overall than the older mlx-community variant, with similar transcription output.
Runs via mlx-audio.
Which variant should you pick?
| Chip | Recommended | Why |
|---|---|---|
| M3 / M4+ | This repo (-4bit, bf16) |
bf16 has a native ALU on M3/M4; same speed as fp16 with a wider exponent range (safer numerics). |
| M1 / M2 | iris-sfg/Voxtral-Mini-4B-Realtime-2602-4bit-fp16 |
Metal on M1/M2 has no native bf16 ALU; bf16 ops fall back to a slower path. The fp16 variant stays on the fast GPU path. |
Only the non-quantized weights differ between the two repos (norms, biases, scales, some embeddings). The quantized mat-mul weights are bit-identical. Transcription output is byte-identical on a 20 s French clip at temperature 0 (verified locally).
Conversion
Source model:
mistralai/Voxtral-Mini-4B-Realtime-2602
Local conversion command:
python -m mlx_audio.convert \
--hf-path mistralai/Voxtral-Mini-4B-Realtime-2602 \
--mlx-path /path/to/Voxtral-Mini-4B-Realtime-2602-4bit \
--quantize \
--q-group-size 64 \
--q-bits 4 \
--model-domain stt
Quantization config:
- bits:
4 - group size:
64 - mode:
affine
Files
This repository intentionally contains only the MLX runtime artifacts needed for inference:
model.safetensorsmodel.safetensors.index.jsonconfig.jsongeneration_config.jsonparams.jsonprocessor_config.jsontekken.json
Usage
pip install "mlx-audio[stt]"
from mlx_audio.stt.utils import load_model
model = load_model("path-or-hf-repo")
result = model.generate("audio.wav")
print(result.text)
Notes
- Base model license remains Apache 2.0.
- Verify latency and decode behavior against your local benchmarks before publishing as a canonical variant.
- Downloads last month
- 45
4-bit
Model tree for iris-sfg/Voxtral-Mini-4B-Realtime-2602-4bit
Base model
mistralai/Ministral-3-3B-Base-2512