---
license: apache-2.0
base_model: mistralai/Voxtral-4B-TTS-2603
pipeline_tag: text-to-speech
library_name: executorch
tags:
- ExecuTorch
- mlx
- apple-silicon
- tts
- voxtral
- on-device
- text-to-speech
---

# Voxtral-4B-TTS-2603-ExecuTorch-MLX

Pre-exported ExecuTorch artifacts for
[Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
with the **MLX backend** for Apple Silicon. The LM decoder and flow head use
bf16 precision with 4-bit weight-only linear quantization and 8-bit embedding
quantization. The codec decoder is exported unquantized and lowered natively to
MLX.

This repository is the Apple Silicon companion to the CUDA artifact repo:
[younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA](https://huggingface.co/younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA).

## Overview

The pipeline has two stages: **export** (Python, once) and **inference**
(C++ runner, repeated). This repo ships the export outputs so you can skip
straight to inference on a locally built ExecuTorch MLX runner.

The model has three components:

1. **Mistral 4B LLM decoder** — autoregressive text to hidden states
2. **Flow Matching Head** — hidden states to 37 audio codebook tokens per frame
3. **Codec Decoder** — codebook tokens to 24 kHz mono waveform

## Files

| File | Size | What |
|---|---:|---|
| `model.pte` | 2.20 GiB | LM decoder, token embedding, audio embedding, semantic head, and flow velocity methods lowered to MLX |
| `codec_decoder.pte` | 289 MiB | Native MLX codec decoder for waveform synthesis |

The tokenizer and voice embeddings are **not included**. Download them from the
base model so they match the upstream Voxtral release.

## Performance

Validated on Apple Silicon with `seed=42` and prompt
`"Hello, how are you today?"`.

| Config | Audio | Generate time | Generation RTF | Process wall | Notes |
|---|---:|---:|---:|---:|---|
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2932 ms | 0.852326 | 4.20 s | refreshed after MLX indexing fix |
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 3132 ms | 0.910465 | 5.19 s | first measured run |
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2634 ms | 0.765698 | 3.15 s | warm run |
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2607 ms | 0.757849 | 3.13 s | warm run |

Latest WAV quality check: peak `0.425764`, clipped samples `0`. Apple Speech
transcribed the original generated sample as `Hello how are you today`.

## Prerequisites

- macOS on Apple Silicon.
- ExecuTorch built from source with `EXECUTORCH_BUILD_MLX=ON`.
- Tokenizer and voice embeddings from
  [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603).

```bash
git clone https://github.com/pytorch/executorch ~/executorch
cd ~/executorch

./install_executorch.sh
pip install -e . --no-build-isolation
make voxtral_tts-mlx
```

The native codec artifacts were validated against ExecuTorch source commit:

```text
ba5b038400299a383dbe93ab394a30f42a953cc1
```

## Download

```bash
pip install huggingface_hub

# ExecuTorch MLX artifacts.
hf download younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX \
    --local-dir voxtral_tts_mlx

# Tokenizer + voice embeddings from the base model.
hf download mistralai/Voxtral-4B-TTS-2603 \
    tekken.json voice_embedding/* \
    --local-dir voxtral_tts_base
```

## Run

```bash
unset CPATH

cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
    --model voxtral_tts_mlx/model.pte \
    --codec voxtral_tts_mlx/codec_decoder.pte \
    --tokenizer voxtral_tts_base/tekken.json \
    --voice voxtral_tts_base/voice_embedding/neutral_female.pt \
    --text "Hello, how are you today?" \
    --output output.wav \
    --seed 42 \
    --max_new_tokens 200
```

Output is 24 kHz mono 16-bit PCM. Listen with:

```bash
ffplay output.wav
```

## Streaming

Add `--streaming` to emit codec output in chunks instead of one batch at the
end. Pair it with `--speaker` to pipe raw `f32le` PCM to stdout for live
playback:

```bash
cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
    --model voxtral_tts_mlx/model.pte \
    --codec voxtral_tts_mlx/codec_decoder.pte \
    --tokenizer voxtral_tts_base/tekken.json \
    --voice voxtral_tts_base/voice_embedding/neutral_female.pt \
    --text "Introducing real-time Voxtral TTS streaming on Apple Silicon with the ExecuTorch MLX backend." \
    --seed 42 \
    --max_new_tokens 200 \
    --streaming \
    --speaker \
  | ffplay -f f32le -sample_rate 24000 -ch_layout mono -nodisp -autoexit -
```

For `aplay` instead: `... | aplay -f FLOAT_LE -r 24000 -c 1`.

## Re-export

```bash
python examples/models/voxtral_tts/export_voxtral_tts.py \
    --model-path ~/models/Voxtral-4B-TTS-2603 \
    --backend mlx \
    --dtype bf16 \
    --qlinear 4w \
    --qembedding 8w \
    --output-dir ./voxtral_tts_exports_mlx_4w
```

`--qembedding 8w` auto-selects `--qembedding-group-size=128`. `--qlinear-codec`
is not yet validated for MLX, so this export keeps the codec unquantized.

## Checksums

```text
904131ac1a1e3552ea4ada566c19eb57d654e662f93f906456aa1f8633825688  model.pte
162178ce94732db05bb74d7240a97f2c5a898b8819a29b5d59ebf076aeda8891  codec_decoder.pte
```