---
license: apache-2.0
library_name: transformers
pipeline_tag: text-to-speech
tags:
- text-to-speech
- voice-cloning
- custom_code
- moss-tts
- moss-tts-local
- arxiv:2603.18090
language:
- zh
- yue
- en
- ar
- cs
- da
- de
- nl
- es
- fr
- fi
- el
- he
- hi
- hu
- ja
- it
- ko
- mk
- ms
- ru
- fa
- pl
- pt
- sv
- ro
- sw
- tl
- th
- tr
- vi
---
# MOSS-TTS Family
# MOSS-TTS-Local-Transformer-v1.5
**MOSS-TTS-Local-Transformer-v1.5** is continued from [MOSS-TTS-Local-Transformer-v1.0](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer). It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For the full 1.0 feature walkthrough, input schema, and evaluation tables, please refer to the [MOSS-TTS-Local-Transformer-v1.0 README](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer).
Compared with [MOSS-TTS-Local-Transformer-v1.0](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer), v1.5 focuses on the following improvements:
- **Higher-fidelity stereo audio modeling**: v1.5 uses [MOSS-Audio-Tokenizer-v2](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-v2) as the audio tokenizer, supporting native 48 kHz stereo input and output for richer spatial detail and more natural perceived audio quality. Since the codec output is stereo, save the `[channels, samples]` tensor returned by `processor.decode(...)` directly.
- **Stronger multilingual synthesis with language tags**: when the `language` field is omitted, v1.5 may improve some languages and regress slightly on others compared with 1.0. When the language is specified, v1.5 is stronger than 1.0 on almost all supported languages. Set the tag when building the user message, for example `processor.build_user_message(text=text_fr, language="French")`.
- **More stable voice cloning**: v1.5 improves speaker similarity and reduces cloning variance, making repeated generations more consistent.
- **Better long-reference, short-text cloning**: v1.5 handles scenarios where the reference audio is much longer than the target text more reliably than 1.0.
- **More stable punctuation-following prosody**: v1.5 follows punctuation-driven pauses more closely, especially in long sentences.
- **Explicit pause control**: v1.5 supports inline pause markers such as `"[pause 3.2s]"`. For example, `我今天学习了一首中国的古诗,它的名字是[pause 3.2s]静夜思!` inserts an explicit 3.2s pause before `静夜思`.
## Supported Languages
MOSS-TTS Local Transformer v1.5 supports **31 languages**. It keeps the 20 languages supported by [MOSS-TTS-Local-Transformer-v1.0](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) and extends multilingual continued training to additional languages including Cantonese, Dutch, Finnish, Hindi, Macedonian, Malay, Romanian, Swahili, Tagalog, Thai, and Vietnamese.
| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
|---|---|---|---|---|---|---|---|---|
| Chinese | zh | 🇨🇳 | Cantonese | yue | 🇭🇰 | English | en | 🇺🇸 |
| Arabic | ar | 🇸🇦 | Czech | cs | 🇨🇿 | Danish | da | 🇩🇰 |
| Dutch | nl | 🇳🇱 | Finnish | fi | 🇫🇮 | French | fr | 🇫🇷 |
| German | de | 🇩🇪 | Greek | el | 🇬🇷 | Hebrew | he | 🇮🇱 |
| Hindi | hi | 🇮🇳 | Hungarian | hu | 🇭🇺 | Italian | it | 🇮🇹 |
| Japanese | ja | 🇯🇵 | Korean | ko | 🇰🇷 | Macedonian | mk | 🇲🇰 |
| Malay | ms | 🇲🇾 | Persian (Farsi) | fa | 🇮🇷 | Polish | pl | 🇵🇱 |
| Portuguese | pt | 🇵🇹 | Romanian | ro | 🇷🇴 | Russian | ru | 🇷🇺 |
| Spanish | es | 🇪🇸 | Swahili | sw | 🇹🇿 | Swedish | sv | 🇸🇪 |
| Tagalog | tl | 🇵🇭 | Thai | th | 🇹🇭 | Turkish | tr | 🇹🇷 |
| Vietnamese | vi | 🇻🇳 | | | | | | |
## Quick Start
### Environment Setup
We recommend a clean, isolated Python environment with **Transformers 5.0.0**, or a recent Transformers version with Qwen3 support, to avoid dependency conflicts.
```bash
conda create -n moss-tts python=3.12 -y
conda activate moss-tts
```
Install all required dependencies:
```bash
git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]"
```
#### (Optional) Install FlashAttention 2
For better speed and lower GPU memory usage, you can install FlashAttention 2 if your hardware supports it.
```bash
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]" --no-build-isolation
```
If your machine has limited RAM and many CPU cores, you can cap build parallelism:
```bash
MAX_JOBS=4 pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]" --no-build-isolation
```
Notes:
- Dependencies are managed in `pyproject.toml`, which currently pins `torch==2.9.1+cu128` and `torchaudio==2.9.1+cu128`.
- If FlashAttention 2 fails to build on your machine, you can skip it and use the default attention backend.
- FlashAttention 2 is only available on supported GPUs and is typically used with `torch.float16` or `torch.bfloat16`.
### Basic Usage
> Tip: MOSS-TTS-Local-Transformer-v1.5 uses a fixed 12-codebook RVQ depth. Do not set `n_vq_for_inference` to a value different from `config.n_vq`.
MOSS-TTS-Local-Transformer-v1.5 provides the standard Hugging Face `AutoProcessor` and `AutoModel` interface. The examples below cover:
1. Direct generation with language tags
2. Voice cloning
3. Duration control
4. Explicit pause control with `[pause X.Ys]`
```python
from pathlib import Path
from tqdm import tqdm
import importlib.util
import torch
import torchaudio
from transformers import AutoModel, AutoProcessor
# Disable the broken cuDNN SDPA backend on some CUDA/PyTorch combinations.
torch.backends.cuda.enable_cudnn_sdp(False)
# Keep these enabled as fallbacks.
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)
pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32
def resolve_attn_implementation() -> str:
# Prefer FlashAttention 2 when package + device conditions are met.
if (
device == "cuda"
and importlib.util.find_spec("flash_attn") is not None
and dtype in {torch.float16, torch.bfloat16}
):
major, _ = torch.cuda.get_device_capability()
if major >= 8:
return "flash_attention_2"
# CUDA fallback: use PyTorch SDPA kernels.
if device == "cuda":
return "sdpa"
# CPU fallback.
return "eager"
attn_implementation = resolve_attn_implementation()
print(f"[INFO] Using attn_implementation={attn_implementation}")
processor = AutoProcessor.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=True,
)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
text_zh = "亲爱的你,愿你的每一天都值得被记住,也值得被珍惜。"
text_en = "We stand on the threshold of the AI era, where intelligence becomes an extension of human creativity."
text_fr = "Bonjour, je voudrais essayer une voix francaise naturelle et stable."
text_pause = "我今天学习了一首中国的古诗,它的名字是[pause 3.2s]静夜思!"
# Use remote demo audio to avoid requiring local assets.
ref_audio_zh = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
ref_audio_en = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"
conversations = [
# Direct TTS. Language tags are recommended in v1.5 when the language is known.
[processor.build_user_message(text=text_zh, language="Chinese")],
[processor.build_user_message(text=text_en, language="English")],
[processor.build_user_message(text=text_fr, language="French")],
# Explicit pause control. Use [pause X.Ys], such as [pause 3.2s].
[processor.build_user_message(text=text_pause, language="Chinese")],
# Voice cloning with a reference audio.
[processor.build_user_message(text=text_zh, reference=[ref_audio_zh], language="Chinese")],
[processor.build_user_message(text=text_en, reference=[ref_audio_en], language="English")],
# Duration control. At 12.5 frames per second, 125 frames is about 10 seconds.
[processor.build_user_message(text=text_en, tokens=125, language="English")],
]
model = AutoModel.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=True,
attn_implementation=attn_implementation,
torch_dtype=dtype,
).to(device)
model.eval()
batch_size = 1
save_dir = Path("inference_root_moss_tts_local_v1_5")
save_dir.mkdir(exist_ok=True, parents=True)
sample_idx = 0
with torch.no_grad():
for start in tqdm(range(0, len(conversations), batch_size)):
batch_conversations = conversations[start : start + batch_size]
batch = processor(batch_conversations, mode="generation")
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=4096,
do_sample=True,
audio_temperature=1.7,
audio_top_p=0.8,
audio_top_k=25,
audio_repetition_penalty=1.0,
)
for message in processor.decode(outputs):
if message is None:
continue
audio = message.audio_codes_list[0]
out_path = save_dir / f"sample{sample_idx}.wav"
sample_idx += 1
# MOSS-TTS Local v1.5 codec returns stereo audio as [channels, samples].
# Save the two-channel tensor directly.
torchaudio.save(str(out_path), audio, processor.model_config.sampling_rate)
```
## Generation Parameters
| Parameter | Recommended | Description |
|---|---:|---|
| `audio_temperature` | `1.7` | Sampling temperature for audio RVQ layers. |
| `audio_top_p` | `0.8` | Nucleus sampling cutoff for audio RVQ layers. |
| `audio_top_k` | `25` | Top-k sampling cutoff for audio RVQ layers. |
| `audio_repetition_penalty` | `1.0` | Penalty for repeated acoustic token patterns. |
| `n_vq_for_inference` | `12` | Fixed by this release. Values other than `config.n_vq` are rejected. |
## Notes
- This repository uses Hugging Face remote code. Load it with `trust_remote_code=True`.
- The MOSS-TTS-Local-Transformer-v1.5 codec is stereo. `processor.decode(...)` returns audio tensors shaped as `[channels, samples]`, so save them directly with `torchaudio.save(path, audio, sampling_rate)`.
- Audio encoding and decoding use `OpenMOSS-Team/MOSS-Audio-Tokenizer-v2`.
- The model configuration sets `sampling_rate` to 48000 and `n_vq` to 12.
- If FlashAttention 2 is unavailable, the example falls back to SDPA on CUDA and eager attention on CPU.
## SGLang Usage
You can serve MOSS-TTS-Local-Transformer-v1.5 with [SGLang-Omni](https://github.com/sgl-project/sglang-omni), which exposes an OpenAI-compatible `/v1/audio/speech` API for reference-less synthesis, zero-shot voice cloning, streaming, duration control, and language/style hints.
See the [MOSS-TTS-Local cookbook](https://github.com/sgl-project/sglang-omni/blob/main/docs/cookbook/moss_tts_local.md) for installation, full API details, deployment config, benchmarking, and limitations.
### Install and Serve
Install `sglang-omni` by following the [SGLang-Omni installation guide](https://sgl-project.github.io/sglang-omni/get_started/installation.html), then download and serve the model:
```bash
hf download OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5
sgl-omni serve \
--model-path OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5 \
--port 8000
```
A matching config file is available in SGLang-Omni at `examples/configs/moss_tts_local.yaml`.
### Basic Speech
```bash
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "SGLang-Omni is a great project!"}' \
--output output.wav
```
### Voice Cloning
Provide a reference clip and its transcript for better speaker similarity. `audio_path` may be a local path readable by the server, an HTTP(S) URL, or a base64 data URI.
```bash
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "SGLang-Omni is a great project!",
"references": [{
"audio_path": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav"
}]
}' \
--output output.wav
```
`ref_audio` and `ref_text` are accepted as shorthand for `references[0].audio_path` and `references[0].text`.
### Streaming
Set `"stream": true`, `"response_format": "pcm"`, and `"stream_format": "audio"` to receive raw 48 kHz PCM chunks. Pipe the stream through `ffmpeg` to write a playable WAV file:
```bash
curl -N -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Get the trust fund to the bank early.",
"ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"stream": true,
"response_format": "pcm",
"stream_format": "audio"
}' \
| ffmpeg -f s16le -ar 48000 -ac 1 -i pipe:0 output_stream.wav
```
### Duration, Markup, and Language
Duration can be guided with an inline `${token:N}` prefix or with `token_count` / `duration_tokens`. Inline markup such as `[pause 0.5s]`, Pinyin, and IPA is passed through unchanged. Use `language` to hint the target language and `instructions` for free-form style guidance.
```bash
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "${token:150}今天天气不错 [pause 0.5s] 就该出去晒晒太阳。",
"ref_audio": "https://huggingface.co/datasets/zhaochenyang20/seed-tts-eval-mini/resolve/main/en/prompt-wavs/common_voice_en_10119832.wav",
"language": "Chinese"
}' \
--output output_markup.wav
```
## More Usage
MOSS-TTS-Local-Transformer-v1.5 is API-compatible with MOSS-TTS-Local-Transformer-v1.0. For continuation with prefix audio, detailed `UserMessage` and `AssistantMessage` fields, generation hyperparameters, Pinyin/IPA preprocessing examples, and evaluation results, see the [MOSS-TTS-Local-Transformer-v1.0](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer).
## Citation
If you use this model, please cite the [MOSS-TTS Technical Report](https://arxiv.org/abs/2603.18090).