younghan-meta
/

Voxtral-4B-TTS-2603-ExecuTorch-MLX

+---
+license: apache-2.0
+base_model: mistralai/Voxtral-4B-TTS-2603
+pipeline_tag: text-to-speech
+library_name: executorch
+tags:
+- ExecuTorch
+- mlx
+- apple-silicon
+- tts
+- voxtral
+- on-device
+- text-to-speech
+---
+# Voxtral-4B-TTS-2603-ExecuTorch-MLX
+Pre-exported ExecuTorch artifacts for
+[Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
+with the **MLX backend** for Apple Silicon. The LM decoder and flow head use
+bf16 precision with 4-bit weight-only linear quantization and 8-bit embedding
+quantization. The codec decoder is exported unquantized and lowered natively to
+MLX.
+This repository is the Apple Silicon companion to the CUDA artifact repo:
+[younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA](https://huggingface.co/younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA).
+## Overview
+The pipeline has two stages: **export** (Python, once) and **inference**
+(C++ runner, repeated). This repo ships the export outputs so you can skip
+straight to inference on a locally built ExecuTorch MLX runner.
+The model has three components:
+1. **Mistral 4B LLM decoder** — autoregressive text to hidden states
+2. **Flow Matching Head** — hidden states to 37 audio codebook tokens per frame
+3. **Codec Decoder** — codebook tokens to 24 kHz mono waveform
+## Files
+| File | Size | What |
+|---|---:|---|
+| `model.pte` | 2.20 GiB | LM decoder, token embedding, audio embedding, semantic head, and flow velocity methods lowered to MLX |
+| `codec_decoder.pte` | 289 MiB | Native MLX codec decoder for waveform synthesis |
+The tokenizer and voice embeddings are **not included**. Download them from the
+base model so they match the upstream Voxtral release.
+## Performance
+Validated on Apple Silicon with `seed=42` and prompt
+`"Hello, how are you today?"`.
+| Config | Audio | Generate time | Generation RTF | Process wall | Notes |
+|---|---:|---:|---:|---:|---|
+| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 3132 ms | 0.910465 | 5.19 s | first measured run |
+| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2634 ms | 0.765698 | 3.15 s | warm run |
+| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2607 ms | 0.757849 | 3.13 s | warm run |
+Average generation RTF: `0.811337` (`0.761774` warm-run average). Average
+process wall time: `3.82 s` (`3.14 s` warm-run average). WAV quality check:
+peak `0.42575`, clipped samples `0`. Apple Speech transcribed the generated
+sample as `Hello how are you today`.
+## Prerequisites
+- macOS on Apple Silicon.
+- ExecuTorch built from source with `EXECUTORCH_BUILD_MLX=ON`.
+- Tokenizer and voice embeddings from
+  [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603).
+```bash
+git clone https://github.com/pytorch/executorch ~/executorch
+cd ~/executorch
+./install_executorch.sh
+pip install -e . --no-build-isolation
+make voxtral_tts-mlx
+```
+The native codec artifacts were validated against ExecuTorch source commit:
+```text
+8ba124624c33fcf12223755d2060b2b7bc739ea8
+```
+## Download
+```bash
+pip install huggingface_hub
+# ExecuTorch MLX artifacts.
+hf download younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX \
+    --local-dir voxtral_tts_mlx
+# Tokenizer + voice embeddings from the base model.
+hf download mistralai/Voxtral-4B-TTS-2603 \
+    tekken.json voice_embedding/* \
+    --local-dir voxtral_tts_base
+```
+## Run
+```bash
+unset CPATH
+cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
+    --model voxtral_tts_mlx/model.pte \
+    --codec voxtral_tts_mlx/codec_decoder.pte \
+    --tokenizer voxtral_tts_base/tekken.json \
+    --voice voxtral_tts_base/voice_embedding/neutral_female.pt \
+    --text "Hello, how are you today?" \
+    --output output.wav \
+    --seed 42 \
+    --max_new_tokens 200
+```
+Output is 24 kHz mono 16-bit PCM. Listen with:
+```bash
+ffplay output.wav
+```
+## Streaming
+Add `--streaming` to emit codec output in chunks instead of one batch at the
+end. Pair it with `--speaker` to pipe raw `f32le` PCM to stdout for live
+playback:
+```bash
+cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
+    --model voxtral_tts_mlx/model.pte \
+    --codec voxtral_tts_mlx/codec_decoder.pte \
+    --tokenizer voxtral_tts_base/tekken.json \
+    --voice voxtral_tts_base/voice_embedding/neutral_female.pt \
+    --text "Introducing real-time Voxtral TTS streaming on Apple Silicon with the ExecuTorch MLX backend." \
+    --seed 42 \
+    --max_new_tokens 200 \
+    --streaming \
+    --speaker \
+  | ffplay -f f32le -sample_rate 24000 -ch_layout mono -nodisp -autoexit -
+```
+For `aplay` instead: `... | aplay -f FLOAT_LE -r 24000 -c 1`.
+## Re-export
+```bash
+python examples/models/voxtral_tts/export_voxtral_tts.py \
+    --model-path ~/models/Voxtral-4B-TTS-2603 \
+    --backend mlx \
+    --dtype bf16 \
+    --qlinear 4w \
+    --qembedding 8w \
+    --output-dir ./voxtral_tts_exports_mlx_4w
+```
+`--qembedding 8w` auto-selects `--qembedding-group-size=128`. `--qlinear-codec`
+is not yet validated for MLX, so this export keeps the codec unquantized.
+## Checksums
+```text
+75597b9b364defaef5db7ade0b77cc11e523e958764d19344e4aa1412ffefa41  model.pte
+53cc5f0acbe2f7e252aba719effad26c756c1d025c80c62ef295fba52837398c  codec_decoder.pte
+```