---
license: apache-2.0
base_model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - tts
  - qwen3-tts
  - tensorrt
  - tensorrt-edge-llm
  - onnx
language:
  - en
  - zh
  - ja
  - ko
  - de
  - fr
  - ru
  - pt
  - es
  - it
---

# Qwen3-TTS-12Hz-1.7B-CustomVoice — TensorRT-Edge-LLM ONNX (FP16)

ONNX export of **[Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice)**
for use with **[NVIDIA TensorRT-Edge-LLM](https://github.com/NVIDIA/TensorRT-Edge-LLM) v0.7.1**. FP16.

> ⚠️ **These files are NOT a plug-and-play model.**
> They are the *intermediate ONNX* consumed by Edge-LLM's engine builder. The Talker/CodePredictor
> graphs contain Edge-LLM's custom `AttentionPlugin` op and a runtime-bound `lm_head`, so they **will
> not run in stock ONNX Runtime or generic TensorRT**. You must use TensorRT-Edge-LLM: build engines
> on your device, then run its `qwen3_tts_inference`. See **Usage**.

## Contents
```
llm/             # Talker (TalkerCausalLM) — model.onnx + model.onnx.data (FP16, ~2.83 GB) + sidecars
code_predictor/  # CodePredictor (residual RVQ codebooks) — model.onnx + .data + lm_heads / codec_embeddings / ...
code2wav/        # Code2Wav vocoder — model.onnx + .data
```
Keep each `model.onnx` next to its `model.onnx.data` (external weights) and the sidecar
`*.safetensors` / `tokenizer.json` in the same directory.

## Requirements
- TensorRT-Edge-LLM **v0.7.1** (version matters — the ONNX op/loader conventions are version-specific).
- An NVIDIA GPU with CUDA/TensorRT supported by Edge-LLM.

## Usage (build engines on your device, then run)
```bash
# Build TensorRT-Edge-LLM v0.7.1, then point at its plugin:
export EDGELLM_PLUGIN_PATH=$PWD/build/libNvInfer_edgellm_plugin.so

# Build the 3 engines from this ONNX (per-GPU; ~5 min):
./build/examples/llm/llm_build          --onnxDir llm            --engineDir engines/talker         --maxInputLen 4096 --maxKVCacheCapacity 4096 --maxBatchSize 1
./build/examples/llm/llm_build          --onnxDir code_predictor --engineDir engines/code_predictor --maxInputLen 4096 --maxKVCacheCapacity 4096 --maxBatchSize 1
./build/examples/multimodal/audio_build --onnxDir code2wav       --engineDir engines/code2wav

# Run inference (input.json: speaker + messages; see Edge-LLM TTS docs):
./build/examples/omni/qwen3_tts_inference \
  --talkerEngineDir engines/talker --code2wavEngineDir engines/code2wav/code2wav \
  --tokenizerDir llm --inputFile input.json --outputAudioDir out
# -> out/audio_req0.wav (24 kHz)
```

**Speakers:** `ryan, serena, aiden, vivian, dylan, eric, uncle_fu, ono_anna, sohee`

## Notes
- These are regenerable from the base model:
  `python -m llm_loader.export_all_cli Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice <out_dir>` (Edge-LLM export tools).
- Some GPU architectures may require runtime/kernel adjustments in Edge-LLM for correct output —
  verify your generated audio (e.g. transcribe it) before relying on it.

## License & attribution
Apache-2.0, inherited from the base model **Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice** (© Alibaba / Qwen).
This repository redistributes an ONNX conversion of those weights. Please cite Qwen.