Qwen3-TTS 12Hz 0.6B Base — GGUF (CrispASR)

GGUF / ggml conversions of Qwen/Qwen3-TTS-12Hz-0.6B-Base for use with the qwen3-tts backend in CrispStrobe/CrispASR.

Qwen3-TTS 12Hz 0.6B Base is Qwen's multilingual voice-cloning TTS model:

10 supported languages: zh en ja ko de fr ru pt es it
discrete multi-codebook LM architecture with a separate 12 Hz tokenizer / codec
runtime voice cloning from (ref_audio, ref_text) or pre-baked voice-pack GGUFs
Apache-2.0 licence

This repo contains the talker / code-predictor / speaker-encoder model. It must be used together with the separate tokenizer / codec GGUF from cstr/qwen3-tts-tokenizer-12hz-GGUF.

Files

File	Size	Notes
`qwen3-tts-12hz-0.6b-base.gguf`	1.7 GB	F16, reference baseline
`qwen3-tts-12hz-0.6b-base-q8_0.gguf`	940 MB	Q8_0, recommended quantised talker
`qwen3-tts-12hz-0.6b-base-q4_k.gguf`	508 MB	Q4_K, smallest — quality regression is significant; only ship if disk space is the binding constraint (see Quantisation Notes)

Quick Start

Build CrispASR:

git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target crispasr

Download the talker + tokenizer:

huggingface-cli download cstr/qwen3-tts-0.6b-base-GGUF \
    qwen3-tts-12hz-0.6b-base-q8_0.gguf --local-dir .

huggingface-cli download cstr/qwen3-tts-tokenizer-12hz-GGUF \
    qwen3-tts-tokenizer-12hz.gguf --local-dir .

Voice clone from a reference WAV:

./build/bin/crispasr \
    --backend qwen3-tts \
    -m qwen3-tts-12hz-0.6b-base-q8_0.gguf \
    --codec-model qwen3-tts-tokenizer-12hz.gguf \
    --voice clone.wav \
    --ref-text "Exact transcript of clone.wav" \
    --tts "Hello there" \
    --tts-output hello.wav

Use a baked voice-pack GGUF:

./build/bin/crispasr \
    --backend qwen3-tts \
    -m qwen3-tts-12hz-0.6b-base-q8_0.gguf \
    --codec-model qwen3-tts-tokenizer-12hz.gguf \
    --voice my-voice-pack.gguf \
    --tts "Hello there" \
    --tts-output hello.wav

When --voice points to a .wav, --ref-text is required. When --voice points to a .gguf, it is treated as a baked voice pack.

Quantisation Notes

Current CrispASR validation status:

qwen3-tts-12hz-0.6b-base.gguf
- reference baseline
qwen3-tts-12hz-0.6b-base-q8_0.gguf
- recommended quantised deployment
- end-to-end synthesis is audibly good in current CrispASR testing
- intermediate activations still drift measurably from F16 in strict tensor diffs
qwen3-tts-12hz-0.6b-base-q4_k.gguf
- smallest variant; loads end-to-end
- content fidelity is significantly worse than Q8_0. In a fixed- seed back-to-back A/B (JFK voice prompt, "Hello world, this is a quick speed benchmark for the qwen three TTS pipeline.", 8 s output), the F16 talker rendered the first ~5 words correctly, Q8_0 rendered the first ~4 words, and Q4_K produced audio that a strong ASR (Qwen3-ASR-0.6B) could not classify as English. Q4_K may recover when given more frames (longer warmup) or longer prompts, but for short utterances it is not a drop-in replacement for Q8_0.
- not numerically faithful to F16 in strict tensor diffs.
- other lower-bit talker quants (q6_k, q5_k) are similarly experimental and not currently shipped — convert with crispasr-quantize if you need them.

If fidelity matters more than memory:

use qwen3-tts-12hz-0.6b-base.gguf
keep the companion tokenizer / codec at F16

If you want the best currently-tested size / quality trade-off:

use qwen3-tts-12hz-0.6b-base-q8_0.gguf
still keep qwen3-tts-tokenizer-12hz.gguf at F16

If disk space is the binding constraint:

use qwen3-tts-12hz-0.6b-base-q4_k.gguf
still keep qwen3-tts-tokenizer-12hz.gguf at F16 (quantising the codec hurts earlier than quantising the talker — runtime_ref_codes is sensitive)

How this was made

The upstream HF safetensors checkpoint was converted to GGUF F16 by models/convert-qwen3-tts-to-gguf.py.
Quantised variants are produced with CrispASR's GGUF quantiser.
Inference is implemented in src/qwen3_tts.cpp, using ggml graphs for the talker / code-predictor path and the companion tokenizer GGUF for codec encode/decode.

Reference implementation

Architecture and behaviour were checked against the official Qwen release:

upstream model card: Qwen/Qwen3-TTS-12Hz-0.6B-Base
upstream repository: QwenLM/Qwen3-TTS
technical report: Qwen3-TTS Technical Report

The CrispASR runtime is a clean C++ / ggml re-implementation for this repo's backend stack.

Companion tokenizer / codec GGUF: cstr/qwen3-tts-tokenizer-12hz-GGUF
Upstream Base model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
C++ runtime: CrispStrobe/CrispASR

License

Apache-2.0, inherited from the base model.

Downloads last month: 419

GGUF

Model size

0.9B params

Architecture

qwen3tts

Hardware compatibility

8-bit

View +1 variant

Model tree for cstr/qwen3-tts-0.6b-base-GGUF

Base model

Qwen/Qwen3-TTS-12Hz-0.6B-Base

Quantized

(16)

this model

Paper for cstr/qwen3-tts-0.6b-base-GGUF

Qwen3-TTS Technical Report

Paper • 2601.15621 • Published Jan 22 • 76

cstr
/

qwen3-tts-0.6b-base-GGUF