bosonai
/

higgs-tts-3-4b

@@ -18,7 +18,9 @@ Choose by constraint, not by habit:
 |------|-----|-------------|
 | Just hear it / try preset voices & avatars | **Live Demo** | https://boson.ai/workspace/avatar |
 | Integrate quickly, no GPU, your own voice | **Hosted API** | https://docs.boson.ai/models/higgs-audio-tts/overview |
-| Data privacy, custom testing, full control | **Self-host (SGLang-Omni)** | https://lmsys.org/blog/2026-06-04-higgs-audio-v3-tts/ |
 | Inspect weights / config / tokenizer | **Model card (this repo)** | https://huggingface.co/bosonai/higgs-audio-v3-tts-4b |
 Deep dive on everything: **Technical blog** → https://boson.ai/blog/higgs-audio-v3-tts
@@ -76,7 +78,8 @@ df -h .                                                                 # disk f
 ```
 Rules for the agent:
-- **No NVIDIA GPU** → stop. Self-host is not viable; steer the user to **Path A (hosted API)**.
 - **≥ 40 GB VRAM (e.g. A100 40 GB, H100)** → known-good; proceed.
 - **24 GB (e.g. RTX 4090)** → *reported* to work, **not officially verified**. The ~4B weights fit,
   but expect to lower concurrency / `max_new_tokens` and watch for OOM at the `serve` step.
@@ -120,6 +123,38 @@ Cookbook reference: https://sgl-project.github.io/sglang-omni/cookbook/higgs_tts
 ---
 ## Control tags — how to write target text
 Embed tags directly in the `input` text to steer emotion, prosody, style, and sound effects.

 |------|-----|-------------|
 | Just hear it / try preset voices & avatars | **Live Demo** | https://boson.ai/workspace/avatar |
 | Integrate quickly, no GPU, your own voice | **Hosted API** | https://docs.boson.ai/models/higgs-audio-tts/overview |
+| Data privacy, custom testing, full control (NVIDIA GPU) | **Self-host (SGLang-Omni)** | https://lmsys.org/blog/2026-06-04-higgs-audio-v3-tts/ |
+| Run locally on a Mac (Apple Silicon, no NVIDIA GPU) | **Self-host (MLX-Audio)** | https://github.com/Blaizzy/mlx-audio |
+| Node-based UI / visual workflow | **ComfyUI (community)** | https://github.com/Saganaki22/Higgs_v3-TTS-ComfyUI |
 | Inspect weights / config / tokenizer | **Model card (this repo)** | https://huggingface.co/bosonai/higgs-audio-v3-tts-4b |
 Deep dive on everything: **Technical blog** → https://boson.ai/blog/higgs-audio-v3-tts
 ```
 Rules for the agent:
+- **No NVIDIA GPU** → stop this path. On an **Apple Silicon Mac**, use **Path C (MLX-Audio)**;
+  for a node-based UI, see **Path D (ComfyUI)**; otherwise use **Path A (hosted API)**.
 - **≥ 40 GB VRAM (e.g. A100 40 GB, H100)** → known-good; proceed.
 - **24 GB (e.g. RTX 4090)** → *reported* to work, **not officially verified**. The ~4B weights fit,
   but expect to lower concurrency / `max_new_tokens` and watch for OOM at the `serve` step.
 ---
+## Path C — Apple Silicon Mac via MLX-Audio (no NVIDIA GPU)
+For Macs there is **no CUDA / Docker path** — use **MLX-Audio**, an Apple-MLX-native TTS library
+that runs the model directly on M-series GPUs: https://github.com/Blaizzy/mlx-audio
+**Hardware (first-hand, measured):** confirmed on an **M1 / 32 GB**, with a peak memory footprint of
+only **~9–12 GB** — comfortably within reach of typical Apple Silicon laptops, no discrete GPU needed.
+```bash
+pip install mlx-audio          # requires Apple Silicon (M1/M2/M3/M4) + macOS
+```
+Drive the model through MLX-Audio's CLI / Python API per its README — see
+https://github.com/Blaizzy/mlx-audio for the exact `generate` command and supported flags.
+> Mac-only. On Linux/NVIDIA use **Path B**; with no local accelerator at all, use **Path A**.
+---
+## Path D — ComfyUI node-based UI (community)
+A community integration exposes the model as ComfyUI nodes (text-to-speech in a visual,
+node-based workflow), with a drag-and-drop workflow file for immediate use:
+- **Repo:** https://github.com/Saganaki22/Higgs_v3-TTS-ComfyUI (by Saganaki22)
+> **Third-party, not maintained by Boson.** Follow that repo's README for install/usage, and verify
+> it against the version of the weights you intend to run. Surfaced in the model's HF discussions:
+> https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/discussions/4
+---
 ## Control tags — how to write target text
 Embed tags directly in the `input` text to steer emotion, prosody, style, and sound effects.