Instructions to use younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Voxtral-4B-TTS-2603-ExecuTorch-MLX younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Add MLX model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,167 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model: mistralai/Voxtral-4B-TTS-2603
|
| 4 |
+
pipeline_tag: text-to-speech
|
| 5 |
+
library_name: executorch
|
| 6 |
+
tags:
|
| 7 |
+
- ExecuTorch
|
| 8 |
+
- mlx
|
| 9 |
+
- apple-silicon
|
| 10 |
+
- tts
|
| 11 |
+
- voxtral
|
| 12 |
+
- on-device
|
| 13 |
+
- text-to-speech
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Voxtral-4B-TTS-2603-ExecuTorch-MLX
|
| 17 |
+
|
| 18 |
+
Pre-exported ExecuTorch artifacts for
|
| 19 |
+
[Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
|
| 20 |
+
with the **MLX backend** for Apple Silicon. The LM decoder and flow head use
|
| 21 |
+
bf16 precision with 4-bit weight-only linear quantization and 8-bit embedding
|
| 22 |
+
quantization. The codec decoder is exported unquantized and lowered natively to
|
| 23 |
+
MLX.
|
| 24 |
+
|
| 25 |
+
This repository is the Apple Silicon companion to the CUDA artifact repo:
|
| 26 |
+
[younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA](https://huggingface.co/younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA).
|
| 27 |
+
|
| 28 |
+
## Overview
|
| 29 |
+
|
| 30 |
+
The pipeline has two stages: **export** (Python, once) and **inference**
|
| 31 |
+
(C++ runner, repeated). This repo ships the export outputs so you can skip
|
| 32 |
+
straight to inference on a locally built ExecuTorch MLX runner.
|
| 33 |
+
|
| 34 |
+
The model has three components:
|
| 35 |
+
|
| 36 |
+
1. **Mistral 4B LLM decoder** — autoregressive text to hidden states
|
| 37 |
+
2. **Flow Matching Head** — hidden states to 37 audio codebook tokens per frame
|
| 38 |
+
3. **Codec Decoder** — codebook tokens to 24 kHz mono waveform
|
| 39 |
+
|
| 40 |
+
## Files
|
| 41 |
+
|
| 42 |
+
| File | Size | What |
|
| 43 |
+
|---|---:|---|
|
| 44 |
+
| `model.pte` | 2.20 GiB | LM decoder, token embedding, audio embedding, semantic head, and flow velocity methods lowered to MLX |
|
| 45 |
+
| `codec_decoder.pte` | 289 MiB | Native MLX codec decoder for waveform synthesis |
|
| 46 |
+
|
| 47 |
+
The tokenizer and voice embeddings are **not included**. Download them from the
|
| 48 |
+
base model so they match the upstream Voxtral release.
|
| 49 |
+
|
| 50 |
+
## Performance
|
| 51 |
+
|
| 52 |
+
Validated on Apple Silicon with `seed=42` and prompt
|
| 53 |
+
`"Hello, how are you today?"`.
|
| 54 |
+
|
| 55 |
+
| Config | Audio | Generate time | Generation RTF | Process wall | Notes |
|
| 56 |
+
|---|---:|---:|---:|---:|---|
|
| 57 |
+
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 3132 ms | 0.910465 | 5.19 s | first measured run |
|
| 58 |
+
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2634 ms | 0.765698 | 3.15 s | warm run |
|
| 59 |
+
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2607 ms | 0.757849 | 3.13 s | warm run |
|
| 60 |
+
|
| 61 |
+
Average generation RTF: `0.811337` (`0.761774` warm-run average). Average
|
| 62 |
+
process wall time: `3.82 s` (`3.14 s` warm-run average). WAV quality check:
|
| 63 |
+
peak `0.42575`, clipped samples `0`. Apple Speech transcribed the generated
|
| 64 |
+
sample as `Hello how are you today`.
|
| 65 |
+
|
| 66 |
+
## Prerequisites
|
| 67 |
+
|
| 68 |
+
- macOS on Apple Silicon.
|
| 69 |
+
- ExecuTorch built from source with `EXECUTORCH_BUILD_MLX=ON`.
|
| 70 |
+
- Tokenizer and voice embeddings from
|
| 71 |
+
[mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603).
|
| 72 |
+
|
| 73 |
+
```bash
|
| 74 |
+
git clone https://github.com/pytorch/executorch ~/executorch
|
| 75 |
+
cd ~/executorch
|
| 76 |
+
|
| 77 |
+
./install_executorch.sh
|
| 78 |
+
pip install -e . --no-build-isolation
|
| 79 |
+
make voxtral_tts-mlx
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
The native codec artifacts were validated against ExecuTorch source commit:
|
| 83 |
+
|
| 84 |
+
```text
|
| 85 |
+
8ba124624c33fcf12223755d2060b2b7bc739ea8
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
## Download
|
| 89 |
+
|
| 90 |
+
```bash
|
| 91 |
+
pip install huggingface_hub
|
| 92 |
+
|
| 93 |
+
# ExecuTorch MLX artifacts.
|
| 94 |
+
hf download younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX \
|
| 95 |
+
--local-dir voxtral_tts_mlx
|
| 96 |
+
|
| 97 |
+
# Tokenizer + voice embeddings from the base model.
|
| 98 |
+
hf download mistralai/Voxtral-4B-TTS-2603 \
|
| 99 |
+
tekken.json voice_embedding/* \
|
| 100 |
+
--local-dir voxtral_tts_base
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
## Run
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
unset CPATH
|
| 107 |
+
|
| 108 |
+
cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
|
| 109 |
+
--model voxtral_tts_mlx/model.pte \
|
| 110 |
+
--codec voxtral_tts_mlx/codec_decoder.pte \
|
| 111 |
+
--tokenizer voxtral_tts_base/tekken.json \
|
| 112 |
+
--voice voxtral_tts_base/voice_embedding/neutral_female.pt \
|
| 113 |
+
--text "Hello, how are you today?" \
|
| 114 |
+
--output output.wav \
|
| 115 |
+
--seed 42 \
|
| 116 |
+
--max_new_tokens 200
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
Output is 24 kHz mono 16-bit PCM. Listen with:
|
| 120 |
+
|
| 121 |
+
```bash
|
| 122 |
+
ffplay output.wav
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
## Streaming
|
| 126 |
+
|
| 127 |
+
Add `--streaming` to emit codec output in chunks instead of one batch at the
|
| 128 |
+
end. Pair it with `--speaker` to pipe raw `f32le` PCM to stdout for live
|
| 129 |
+
playback:
|
| 130 |
+
|
| 131 |
+
```bash
|
| 132 |
+
cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
|
| 133 |
+
--model voxtral_tts_mlx/model.pte \
|
| 134 |
+
--codec voxtral_tts_mlx/codec_decoder.pte \
|
| 135 |
+
--tokenizer voxtral_tts_base/tekken.json \
|
| 136 |
+
--voice voxtral_tts_base/voice_embedding/neutral_female.pt \
|
| 137 |
+
--text "Introducing real-time Voxtral TTS streaming on Apple Silicon with the ExecuTorch MLX backend." \
|
| 138 |
+
--seed 42 \
|
| 139 |
+
--max_new_tokens 200 \
|
| 140 |
+
--streaming \
|
| 141 |
+
--speaker \
|
| 142 |
+
| ffplay -f f32le -sample_rate 24000 -ch_layout mono -nodisp -autoexit -
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
For `aplay` instead: `... | aplay -f FLOAT_LE -r 24000 -c 1`.
|
| 146 |
+
|
| 147 |
+
## Re-export
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
python examples/models/voxtral_tts/export_voxtral_tts.py \
|
| 151 |
+
--model-path ~/models/Voxtral-4B-TTS-2603 \
|
| 152 |
+
--backend mlx \
|
| 153 |
+
--dtype bf16 \
|
| 154 |
+
--qlinear 4w \
|
| 155 |
+
--qembedding 8w \
|
| 156 |
+
--output-dir ./voxtral_tts_exports_mlx_4w
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
`--qembedding 8w` auto-selects `--qembedding-group-size=128`. `--qlinear-codec`
|
| 160 |
+
is not yet validated for MLX, so this export keeps the codec unquantized.
|
| 161 |
+
|
| 162 |
+
## Checksums
|
| 163 |
+
|
| 164 |
+
```text
|
| 165 |
+
75597b9b364defaef5db7ade0b77cc11e523e958764d19344e4aa1412ffefa41 model.pte
|
| 166 |
+
53cc5f0acbe2f7e252aba719effad26c756c1d025c80c62ef295fba52837398c codec_decoder.pte
|
| 167 |
+
```
|