---
license: mit
language:
- zh
- en
pipeline_tag: text-to-speech
tags:
- comfyui
- audiodit
- transformers
---

# LongCat-AudioDiT-3.5B — FP8 Weight-Only Quantized (ComfyUI)

**FP8 quantized version of [meituan-longcat/LongCat-AudioDiT-3.5B](https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B).**

[**Original Model**](https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B) | [**Paper**](https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LongCat-AudioDiT.pdf) | [**GitHub (Original)**](https://github.com/meituan-longcat/LongCat-AudioDiT) | [**ComfyUI Node**](https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS)

---


![Screenshot 2026-03-30 210100](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/oU6qjLAgmxPXEjsCVCXyg.png)

## What is this?

This is a weight-only FP8 quantization of LongCat-AudioDiT-3.5B — a state-of-the-art diffusion-based zero-shot TTS model by Meituan that operates directly in the waveform latent space. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~14 GB to ~4 GB, with no perceptible quality loss in practice.

| | Original (3.5B FP32) | This (3.5B FP8) |
|---|---|---|
| **Weight dtype** | float32 | float8_e4m3fn |
| **Activation dtype** | bfloat16 | bfloat16 |
| **Scale** | — | per-tensor float32 |
| **File size** | ~14 GB | ~4 GB |
| **VRAM (inference)** | ~20 GB | ~8 GB |
| **Extra dependencies** | none | none |

---

## Quantization Details

**What is quantized:** All linear weight matrices in the DiT transformer backbone. Non-linear weights (VAE, embeddings, layer norms, text encoder) remain in bfloat16.

**Method: Per-tensor symmetric FP8**

Each weight matrix is quantized using a per-tensor scale factor stored in `fp8_scales.json`:
````
scale = max(abs(W)) / FP8_MAX        # FP8_MAX = 448.0 for float8_e4m3fn
W_fp8 = round(W_fp32 / scale)        # quantize
W_bf16 = W_fp8.to(bfloat16) * scale  # dequantize at inference
````

**No external quantization library required.** Dequantization is handled automatically by the [ComfyUI-LongCat-AudioDIT-TTS](https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS) node pack at load time. The model is fully compatible with the original `audiodit` inference code once dequantized.

**File layout:**
- `model.safetensors` — FP8 weight tensors
- `fp8_scales.json` — per-tensor float32 scale factors for dequantization

---

## Hardware Requirements

- **GPU:** NVIDIA GPU with CUDA support
- **VRAM:** ~8 GB
- **Native FP8 tensor cores:** Ada Lovelace or Blackwell (RTX 4090, RTX 5090, H100, etc.) — recommended for full speed
- **Older GPUs (Ampere and below):** Will load and run correctly. Dequantization to bfloat16 happens on all hardware, so you still get the reduced VRAM footprint even without native FP8 cores.

---

## Usage — ComfyUI (Recommended)

The easiest way to use this model is with **[ComfyUI-LongCat-AudioDIT-TTS](https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS)**, which has native support for this FP8 model with zero extra setup.

### Installation

1. Install the ComfyUI node via **ComfyUI Manager** (search `LongCat-AudioDiT`) or manually:
```bash
   cd ComfyUI/custom_nodes
   git clone https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git
```

2. The model **auto-downloads on first use** — select `LongCat-AudioDiT-3.5B-fp8` from the model dropdown in any LongCat node.

3. Or download manually:
```bash
   huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-fp8 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-fp8
```

### Available Nodes

- **LongCat AudioDiT TTS** — Zero-shot text-to-speech
- **LongCat AudioDiT Voice Clone TTS** — Voice cloning from reference audio
- **LongCat AudioDiT Multi-Speaker TTS** — Multi-speaker conversation synthesis

### Recommended Settings

- `dtype`: `auto` or `bf16` — FP8 weights are dequantized to BF16 at load time
- `guidance_method`: `cfg` for TTS, `apg` for voice cloning
- `steps`: `16` (balanced), `32` (higher quality)
- `keep_model_loaded`: `True` for repeated use

---

## About LongCat-AudioDiT

LongCat-AudioDiT is a non-autoregressive diffusion-based TTS model from Meituan that achieves state-of-the-art zero-shot voice cloning performance on the Seed benchmark. Unlike previous methods relying on mel-spectrograms, it operates directly in the waveform latent space using only a Wav-VAE and a DiT backbone.

The 3.5B variant achieves **0.818 SIM on Seed-ZH** and **0.797 SIM on Seed-Hard**, surpassing both open-source and closed-source competitors.

---

## License

This model inherits the [MIT License](https://github.com/meituan-longcat/LongCat-AudioDiT/blob/main/LICENSE) from [meituan-longcat/LongCat-AudioDiT-3.5B](https://huggingface.co/meituan-longcat/LongCat-AudioDiT-3.5B).

The FP8 quantization was produced by [drbaph](https://huggingface.co/drbaph) and is released under the same license.