Instructions to use bosonai/higgs-tts-3-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bosonai/higgs-tts-3-4b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="bosonai/higgs-tts-3-4b")# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("bosonai/higgs-tts-3-4b", dtype="auto") - Notebooks
- Google Colab
- Kaggle
AGENTS.md — Higgs Audio v3 TTS (4B)
Operational guide for AI coding agents. This file is self-contained: you can act on it even before cloning this repo. For model background, benchmarks, the full language list, and citation, see the model card README — don't duplicate that narrative here.
Higgs Audio v3 TTS is a 4B-parameter, conversational text-to-speech model: expressive, low-latency, 100+ languages, zero-shot voice cloning, and inline control over emotion / prosody / pauses / sound effects mid-utterance.
Step 0 — Pick the right path (read this first)
Choose by constraint, not by habit:
| Goal | Use | Entry point |
|---|---|---|
| Just hear it / try preset voices & avatars | Live Demo | https://boson.ai/workspace/avatar |
| Integrate quickly, no GPU, your own voice | Hosted API | https://docs.boson.ai/models/higgs-audio-tts/overview |
| Data privacy, custom testing, full control (NVIDIA GPU) | Self-host (SGLang-Omni) | https://lmsys.org/blog/2026-06-04-higgs-audio-v3-tts/ |
| Run locally on a Mac (Apple Silicon, no NVIDIA GPU) | Self-host (MLX-Audio) | https://github.com/Blaizzy/mlx-audio |
| Node-based UI / visual workflow | ComfyUI (community) | https://github.com/Saganaki22/Higgs_v3-TTS-ComfyUI |
| Inspect weights / config / tokenizer | Model card (this repo) | https://huggingface.co/bosonai/higgs-audio-v3-tts-4b |
Deep dive on everything: Technical blog → https://boson.ai/blog/higgs-audio-v3-tts
Path A — Hosted API (fastest, no GPU)
Authoritative docs: https://docs.boson.ai/models/higgs-audio-tts/overview Get an API key, full field reference, and Python/TypeScript SDK examples there. An agent cannot invent a key — if
BOSON_API_KEYis unset, stop and point the user to this page.
export BOSON_API_KEY=bai-xxxx # obtain from https://docs.boson.ai (key format: bai-...)
Basic synthesis:
curl https://api.boson.ai/v1/audio/speech \
-H "Authorization: Bearer $BOSON_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"higgs-audio-v3-tts","input":"Hello, this is a test."}' \
--output out.mp3
Request fields:
| Field | Notes |
|---|---|
model |
"higgs-audio-v3-tts" |
input |
text to synthesize (required) |
voice |
preset speaker, e.g. "jake" |
ref_audio + ref_text |
URL/base64 clip + its transcript → voice cloning |
response_format |
"mp3" (default) or "pcm" (use pcm for low-latency streaming) |
stream |
true for SSE streaming |
Verify exact field names/limits against the API docs before shipping — the hosted API evolves independently of these weights.
Path B — Self-host with SGLang-Omni
B0 — Preflight: confirm hardware first (do this before pulling anything)
Performance numbers are benchmarked on 1× H100 (80 GB). The model is also confirmed to run on 1× A100 40 GB — so ~40 GB VRAM is a known-good floor. Smaller GPUs are untested (no data, not "won't work"). Before deploying:
nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv # GPU present? how much VRAM?
docker --version && docker info | grep -i runtime # Docker + NVIDIA runtime ready?
df -h . # disk for the ~4B weights + image
Rules for the agent:
- No NVIDIA GPU → stop this path. On an Apple Silicon Mac, use Path C (MLX-Audio); for a node-based UI, see Path D (ComfyUI); otherwise use Path A (hosted API).
- ≥ 40 GB VRAM (e.g. A100 40 GB, H100) → known-good; proceed.
- 24 GB (e.g. RTX 4090) → reported to work, not officially verified. The ~4B weights fit,
but expect to lower concurrency /
max_new_tokensand watch for OOM at theservestep. - < 24 GB VRAM → untested. It may still run (4B model), but no one has verified it. Warn the
user, and be ready to lower concurrency /
max_new_tokensif you hit OOM at theservestep. - Don't assume a VRAM number — confirm against the SGLang-Omni cookbook / blog before promising a given GPU will work: https://lmsys.org/blog/2026-06-04-higgs-audio-v3-tts/
B1 — Install & serve
# 1. Container
docker pull lmsysorg/sglang-omni:dev
docker run -it --gpus all --shm-size 32g --ipc host --network host --privileged \
lmsysorg/sglang-omni:dev /bin/zsh
# 2. Engine
git clone git@github.com:sgl-project/sglang-omni.git && cd sglang-omni
uv venv .venv -p 3.12 && source .venv/bin/activate
uv pip install -v -e .
# 3. Weights
hf download bosonai/higgs-audio-v3-tts-4b
# 4. Serve (OpenAI-compatible audio endpoint)
sgl-omni serve --model-path bosonai/higgs-audio-v3-tts-4b --port 8000
Call the local server:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, how are you?"}' \
--output output.wav
Recommended sampling (voice cloning): temperature: 0.8, top_k: 50, max_new_tokens: 1024.
Cookbook reference: https://sgl-project.github.io/sglang-omni/cookbook/higgs_tts.html
Path C — Apple Silicon Mac via MLX-Audio (no NVIDIA GPU)
For Macs there is no CUDA / Docker path — use MLX-Audio, an Apple-MLX-native TTS library that runs the model directly on M-series GPUs: https://github.com/Blaizzy/mlx-audio
Hardware (first-hand, measured): confirmed on an M1 / 32 GB, with a peak memory footprint of only ~9–12 GB — comfortably within reach of typical Apple Silicon laptops, no discrete GPU needed.
pip install mlx-audio # requires Apple Silicon (M1/M2/M3/M4) + macOS
Drive the model through MLX-Audio's CLI / Python API per its README — see
https://github.com/Blaizzy/mlx-audio for the exact generate command and supported flags.
Mac-only. On Linux/NVIDIA use Path B; with no local accelerator at all, use Path A.
Path D — ComfyUI node-based UI (community)
A community integration exposes the model as ComfyUI nodes (text-to-speech in a visual, node-based workflow), with a drag-and-drop workflow file for immediate use:
- Repo: https://github.com/Saganaki22/Higgs_v3-TTS-ComfyUI (by Saganaki22)
Third-party, not maintained by Boson. Follow that repo's README for install/usage, and verify it against the version of the weights you intend to run. Surfaced in the model's HF discussions: https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/discussions/4
Control tags — how to write target text
Embed tags directly in the input text to steer emotion, prosody, style, and sound effects.
Format is always <|category:tag|>, with two placements:
- Sentence-level (emotion / style / prosody speed·pitch·expressive) → put at the sentence start.
- Inline (sfx, and prosody
pause/long_pause) → insert at the exact spot in the sentence. sfxgotcha:<|sfx:cough|>Ahem, ...— tag first, onomatopoeia attached, no space.
<|emotion:elation|>Welcome aboard, we are thrilled to have you here!
<|emotion:elation|><|sfx:laughter|>Haha, welcome, we're so happy you're here!
Hello there <|prosody:pause|> and welcome to the show.
Full 43-tag catalog + rules + examples → PROMPTING.md. Only recognized tags work — anything else degrades output or gets read literally.
For chat formatting, use chat_template.jinja from the model repo (and the API docs);
do not hand-assemble the chat prompt — go through the template.
Language codes
Only the ISO codes listed in README's supported-languages section are reliable. Codes outside that
list fall back / degrade. → see the model card README (## Supported Languages).
Repo contents (what's actually here)
This repo (https://huggingface.co/bosonai/higgs-audio-v3-tts-4b) is weights + config, not an
inference codebase:
config.json,model.safetensors(.index.json)— model weights & shapechat_template.jinja— authoritative prompt/chat formatting; respect ittokenizer.json,tokenizer_config.json— tokenizerREADME.md— HuggingFace model card (capabilities, benchmarks, languages, citation)LICENSE— see red line below
Do / Don't
- ✅ Use
chat_template.jinjafor prompt construction; use the OpenAI-compatible/v1/audio/speechshape. - ✅ Use
pcm+streamfor real-time / conversational latency. - ❌ Don't use commercially. License is research & non-commercial
(
boson-higgs-audio-v3-research-and-non-commercial-license) — see LICENSE. - ❌ Don't hardcode the 100-language claim as "any code works" — validate against the supported list.
Pointers (don't duplicate — link)
All on the model card: https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/blob/main/README.md
- Benchmarks / WER-CER tables → README
## Evaluation Benchmarks - Full language list → README
## Supported Languages - Control-token catalog → README
## Control Tokens - Citation → README
## Citation