--- license: mit language: - zh - en pipeline_tag: text-to-speech tags: - comfyui - audiodit - transformers --- # LongCat-AudioDiT-3.5B — FP8 Weight-Only Quantized (ComfyUI) **FP8 quantized version of [meituan-longcat/LongCat-AudioDiT-3.5B](https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B).** [**Original Model**](https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B) | [**Paper**](https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LongCat-AudioDiT.pdf) | [**GitHub (Original)**](https://github.com/meituan-longcat/LongCat-AudioDiT) | [**ComfyUI Node**](https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS) --- ![Screenshot 2026-03-30 210100](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/oU6qjLAgmxPXEjsCVCXyg.png) ## What is this? This is a weight-only FP8 quantization of LongCat-AudioDiT-3.5B — a state-of-the-art diffusion-based zero-shot TTS model by Meituan that operates directly in the waveform latent space. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~14 GB to ~4 GB, with no perceptible quality loss in practice. | | Original (3.5B FP32) | This (3.5B FP8) | |---|---|---| | **Weight dtype** | float32 | float8_e4m3fn | | **Activation dtype** | bfloat16 | bfloat16 | | **Scale** | — | per-tensor float32 | | **File size** | ~14 GB | ~4 GB | | **VRAM (inference)** | ~20 GB | ~8 GB | | **Extra dependencies** | none | none | --- ## Quantization Details **What is quantized:** All linear weight matrices in the DiT transformer backbone. Non-linear weights (VAE, embeddings, layer norms, text encoder) remain in bfloat16. **Method: Per-tensor symmetric FP8** Each weight matrix is quantized using a per-tensor scale factor stored in `fp8_scales.json`: ```` scale = max(abs(W)) / FP8_MAX # FP8_MAX = 448.0 for float8_e4m3fn W_fp8 = round(W_fp32 / scale) # quantize W_bf16 = W_fp8.to(bfloat16) * scale # dequantize at inference ```` **No external quantization library required.** Dequantization is handled automatically by the [ComfyUI-LongCat-AudioDIT-TTS](https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS) node pack at load time. The model is fully compatible with the original `audiodit` inference code once dequantized. **File layout:** - `model.safetensors` — FP8 weight tensors - `fp8_scales.json` — per-tensor float32 scale factors for dequantization --- ## Hardware Requirements - **GPU:** NVIDIA GPU with CUDA support - **VRAM:** ~8 GB - **Native FP8 tensor cores:** Ada Lovelace or Blackwell (RTX 4090, RTX 5090, H100, etc.) — recommended for full speed - **Older GPUs (Ampere and below):** Will load and run correctly. Dequantization to bfloat16 happens on all hardware, so you still get the reduced VRAM footprint even without native FP8 cores. --- ## Usage — ComfyUI (Recommended) The easiest way to use this model is with **[ComfyUI-LongCat-AudioDIT-TTS](https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS)**, which has native support for this FP8 model with zero extra setup. ### Installation 1. Install the ComfyUI node via **ComfyUI Manager** (search `LongCat-AudioDiT`) or manually: ```bash cd ComfyUI/custom_nodes git clone https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git ``` 2. The model **auto-downloads on first use** — select `LongCat-AudioDiT-3.5B-fp8` from the model dropdown in any LongCat node. 3. Or download manually: ```bash huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-fp8 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-fp8 ``` ### Available Nodes - **LongCat AudioDiT TTS** — Zero-shot text-to-speech - **LongCat AudioDiT Voice Clone TTS** — Voice cloning from reference audio - **LongCat AudioDiT Multi-Speaker TTS** — Multi-speaker conversation synthesis ### Recommended Settings - `dtype`: `auto` or `bf16` — FP8 weights are dequantized to BF16 at load time - `guidance_method`: `cfg` for TTS, `apg` for voice cloning - `steps`: `16` (balanced), `32` (higher quality) - `keep_model_loaded`: `True` for repeated use --- ## About LongCat-AudioDiT LongCat-AudioDiT is a non-autoregressive diffusion-based TTS model from Meituan that achieves state-of-the-art zero-shot voice cloning performance on the Seed benchmark. Unlike previous methods relying on mel-spectrograms, it operates directly in the waveform latent space using only a Wav-VAE and a DiT backbone. The 3.5B variant achieves **0.818 SIM on Seed-ZH** and **0.797 SIM on Seed-Hard**, surpassing both open-source and closed-source competitors. --- ## License This model inherits the [MIT License](https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LICENSE) from [meituan-longcat/LongCat-AudioDiT-3.5B](https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B). The FP8 quantization was produced by [drbaph](https://huggingface.co/drbaph) and is released under the same license.