younghan-meta commited on
Commit
e4e9882
·
verified ·
1 Parent(s): 39913d6

Add MLX model card

Browse files
Files changed (1) hide show
  1. README.md +167 -0
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: mistralai/Voxtral-4B-TTS-2603
4
+ pipeline_tag: text-to-speech
5
+ library_name: executorch
6
+ tags:
7
+ - ExecuTorch
8
+ - mlx
9
+ - apple-silicon
10
+ - tts
11
+ - voxtral
12
+ - on-device
13
+ - text-to-speech
14
+ ---
15
+
16
+ # Voxtral-4B-TTS-2603-ExecuTorch-MLX
17
+
18
+ Pre-exported ExecuTorch artifacts for
19
+ [Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
20
+ with the **MLX backend** for Apple Silicon. The LM decoder and flow head use
21
+ bf16 precision with 4-bit weight-only linear quantization and 8-bit embedding
22
+ quantization. The codec decoder is exported unquantized and lowered natively to
23
+ MLX.
24
+
25
+ This repository is the Apple Silicon companion to the CUDA artifact repo:
26
+ [younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA](https://huggingface.co/younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA).
27
+
28
+ ## Overview
29
+
30
+ The pipeline has two stages: **export** (Python, once) and **inference**
31
+ (C++ runner, repeated). This repo ships the export outputs so you can skip
32
+ straight to inference on a locally built ExecuTorch MLX runner.
33
+
34
+ The model has three components:
35
+
36
+ 1. **Mistral 4B LLM decoder** — autoregressive text to hidden states
37
+ 2. **Flow Matching Head** — hidden states to 37 audio codebook tokens per frame
38
+ 3. **Codec Decoder** — codebook tokens to 24 kHz mono waveform
39
+
40
+ ## Files
41
+
42
+ | File | Size | What |
43
+ |---|---:|---|
44
+ | `model.pte` | 2.20 GiB | LM decoder, token embedding, audio embedding, semantic head, and flow velocity methods lowered to MLX |
45
+ | `codec_decoder.pte` | 289 MiB | Native MLX codec decoder for waveform synthesis |
46
+
47
+ The tokenizer and voice embeddings are **not included**. Download them from the
48
+ base model so they match the upstream Voxtral release.
49
+
50
+ ## Performance
51
+
52
+ Validated on Apple Silicon with `seed=42` and prompt
53
+ `"Hello, how are you today?"`.
54
+
55
+ | Config | Audio | Generate time | Generation RTF | Process wall | Notes |
56
+ |---|---:|---:|---:|---:|---|
57
+ | MLX bf16 + 4w linear + 8w embedding | 3.44 s | 3132 ms | 0.910465 | 5.19 s | first measured run |
58
+ | MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2634 ms | 0.765698 | 3.15 s | warm run |
59
+ | MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2607 ms | 0.757849 | 3.13 s | warm run |
60
+
61
+ Average generation RTF: `0.811337` (`0.761774` warm-run average). Average
62
+ process wall time: `3.82 s` (`3.14 s` warm-run average). WAV quality check:
63
+ peak `0.42575`, clipped samples `0`. Apple Speech transcribed the generated
64
+ sample as `Hello how are you today`.
65
+
66
+ ## Prerequisites
67
+
68
+ - macOS on Apple Silicon.
69
+ - ExecuTorch built from source with `EXECUTORCH_BUILD_MLX=ON`.
70
+ - Tokenizer and voice embeddings from
71
+ [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603).
72
+
73
+ ```bash
74
+ git clone https://github.com/pytorch/executorch ~/executorch
75
+ cd ~/executorch
76
+
77
+ ./install_executorch.sh
78
+ pip install -e . --no-build-isolation
79
+ make voxtral_tts-mlx
80
+ ```
81
+
82
+ The native codec artifacts were validated against ExecuTorch source commit:
83
+
84
+ ```text
85
+ 8ba124624c33fcf12223755d2060b2b7bc739ea8
86
+ ```
87
+
88
+ ## Download
89
+
90
+ ```bash
91
+ pip install huggingface_hub
92
+
93
+ # ExecuTorch MLX artifacts.
94
+ hf download younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX \
95
+ --local-dir voxtral_tts_mlx
96
+
97
+ # Tokenizer + voice embeddings from the base model.
98
+ hf download mistralai/Voxtral-4B-TTS-2603 \
99
+ tekken.json voice_embedding/* \
100
+ --local-dir voxtral_tts_base
101
+ ```
102
+
103
+ ## Run
104
+
105
+ ```bash
106
+ unset CPATH
107
+
108
+ cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
109
+ --model voxtral_tts_mlx/model.pte \
110
+ --codec voxtral_tts_mlx/codec_decoder.pte \
111
+ --tokenizer voxtral_tts_base/tekken.json \
112
+ --voice voxtral_tts_base/voice_embedding/neutral_female.pt \
113
+ --text "Hello, how are you today?" \
114
+ --output output.wav \
115
+ --seed 42 \
116
+ --max_new_tokens 200
117
+ ```
118
+
119
+ Output is 24 kHz mono 16-bit PCM. Listen with:
120
+
121
+ ```bash
122
+ ffplay output.wav
123
+ ```
124
+
125
+ ## Streaming
126
+
127
+ Add `--streaming` to emit codec output in chunks instead of one batch at the
128
+ end. Pair it with `--speaker` to pipe raw `f32le` PCM to stdout for live
129
+ playback:
130
+
131
+ ```bash
132
+ cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
133
+ --model voxtral_tts_mlx/model.pte \
134
+ --codec voxtral_tts_mlx/codec_decoder.pte \
135
+ --tokenizer voxtral_tts_base/tekken.json \
136
+ --voice voxtral_tts_base/voice_embedding/neutral_female.pt \
137
+ --text "Introducing real-time Voxtral TTS streaming on Apple Silicon with the ExecuTorch MLX backend." \
138
+ --seed 42 \
139
+ --max_new_tokens 200 \
140
+ --streaming \
141
+ --speaker \
142
+ | ffplay -f f32le -sample_rate 24000 -ch_layout mono -nodisp -autoexit -
143
+ ```
144
+
145
+ For `aplay` instead: `... | aplay -f FLOAT_LE -r 24000 -c 1`.
146
+
147
+ ## Re-export
148
+
149
+ ```bash
150
+ python examples/models/voxtral_tts/export_voxtral_tts.py \
151
+ --model-path ~/models/Voxtral-4B-TTS-2603 \
152
+ --backend mlx \
153
+ --dtype bf16 \
154
+ --qlinear 4w \
155
+ --qembedding 8w \
156
+ --output-dir ./voxtral_tts_exports_mlx_4w
157
+ ```
158
+
159
+ `--qembedding 8w` auto-selects `--qembedding-group-size=128`. `--qlinear-codec`
160
+ is not yet validated for MLX, so this export keeps the codec unquantized.
161
+
162
+ ## Checksums
163
+
164
+ ```text
165
+ 75597b9b364defaef5db7ade0b77cc11e523e958764d19344e4aa1412ffefa41 model.pte
166
+ 53cc5f0acbe2f7e252aba719effad26c756c1d025c80c62ef295fba52837398c codec_decoder.pte
167
+ ```