--- license: apache-2.0 base_model: mistralai/Voxtral-4B-TTS-2603 pipeline_tag: text-to-speech library_name: executorch tags: - ExecuTorch - mlx - apple-silicon - tts - voxtral - on-device - text-to-speech --- # Voxtral-4B-TTS-2603-ExecuTorch-MLX Pre-exported ExecuTorch artifacts for [Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) with the **MLX backend** for Apple Silicon. The LM decoder and flow head use bf16 precision with 4-bit weight-only linear quantization and 8-bit embedding quantization. The codec decoder is exported unquantized and lowered natively to MLX. This repository is the Apple Silicon companion to the CUDA artifact repo: [younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA](https://huggingface.co/younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA). ## Overview The pipeline has two stages: **export** (Python, once) and **inference** (C++ runner, repeated). This repo ships the export outputs so you can skip straight to inference on a locally built ExecuTorch MLX runner. The model has three components: 1. **Mistral 4B LLM decoder** — autoregressive text to hidden states 2. **Flow Matching Head** — hidden states to 37 audio codebook tokens per frame 3. **Codec Decoder** — codebook tokens to 24 kHz mono waveform ## Files | File | Size | What | |---|---:|---| | `model.pte` | 2.20 GiB | LM decoder, token embedding, audio embedding, semantic head, and flow velocity methods lowered to MLX | | `codec_decoder.pte` | 289 MiB | Native MLX codec decoder for waveform synthesis | The tokenizer and voice embeddings are **not included**. Download them from the base model so they match the upstream Voxtral release. ## Performance Validated on Apple Silicon with `seed=42` and prompt `"Hello, how are you today?"`. | Config | Audio | Generate time | Generation RTF | Process wall | Notes | |---|---:|---:|---:|---:|---| | MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2932 ms | 0.852326 | 4.20 s | refreshed after MLX indexing fix | | MLX bf16 + 4w linear + 8w embedding | 3.44 s | 3132 ms | 0.910465 | 5.19 s | first measured run | | MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2634 ms | 0.765698 | 3.15 s | warm run | | MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2607 ms | 0.757849 | 3.13 s | warm run | Latest WAV quality check: peak `0.425764`, clipped samples `0`. Apple Speech transcribed the original generated sample as `Hello how are you today`. ## Prerequisites - macOS on Apple Silicon. - ExecuTorch built from source with `EXECUTORCH_BUILD_MLX=ON`. - Tokenizer and voice embeddings from [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603). ```bash git clone https://github.com/pytorch/executorch ~/executorch cd ~/executorch ./install_executorch.sh pip install -e . --no-build-isolation make voxtral_tts-mlx ``` The native codec artifacts were validated against ExecuTorch source commit: ```text ba5b038400299a383dbe93ab394a30f42a953cc1 ``` ## Download ```bash pip install huggingface_hub # ExecuTorch MLX artifacts. hf download younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX \ --local-dir voxtral_tts_mlx # Tokenizer + voice embeddings from the base model. hf download mistralai/Voxtral-4B-TTS-2603 \ tekken.json voice_embedding/* \ --local-dir voxtral_tts_base ``` ## Run ```bash unset CPATH cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \ --model voxtral_tts_mlx/model.pte \ --codec voxtral_tts_mlx/codec_decoder.pte \ --tokenizer voxtral_tts_base/tekken.json \ --voice voxtral_tts_base/voice_embedding/neutral_female.pt \ --text "Hello, how are you today?" \ --output output.wav \ --seed 42 \ --max_new_tokens 200 ``` Output is 24 kHz mono 16-bit PCM. Listen with: ```bash ffplay output.wav ``` ## Streaming Add `--streaming` to emit codec output in chunks instead of one batch at the end. Pair it with `--speaker` to pipe raw `f32le` PCM to stdout for live playback: ```bash cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \ --model voxtral_tts_mlx/model.pte \ --codec voxtral_tts_mlx/codec_decoder.pte \ --tokenizer voxtral_tts_base/tekken.json \ --voice voxtral_tts_base/voice_embedding/neutral_female.pt \ --text "Introducing real-time Voxtral TTS streaming on Apple Silicon with the ExecuTorch MLX backend." \ --seed 42 \ --max_new_tokens 200 \ --streaming \ --speaker \ | ffplay -f f32le -sample_rate 24000 -ch_layout mono -nodisp -autoexit - ``` For `aplay` instead: `... | aplay -f FLOAT_LE -r 24000 -c 1`. ## Re-export ```bash python examples/models/voxtral_tts/export_voxtral_tts.py \ --model-path ~/models/Voxtral-4B-TTS-2603 \ --backend mlx \ --dtype bf16 \ --qlinear 4w \ --qembedding 8w \ --output-dir ./voxtral_tts_exports_mlx_4w ``` `--qembedding 8w` auto-selects `--qembedding-group-size=128`. `--qlinear-codec` is not yet validated for MLX, so this export keeps the codec unquantized. ## Checksums ```text 904131ac1a1e3552ea4ada566c19eb57d654e662f93f906456aa1f8633825688 model.pte 162178ce94732db05bb74d7240a97f2c5a898b8819a29b5d59ebf076aeda8891 codec_decoder.pte ```