Instructions to use 9r4n4y/LongCat-AudioDiT-3.5B-fp8-backup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 9r4n4y/LongCat-AudioDiT-3.5B-fp8-backup with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="9r4n4y/LongCat-AudioDiT-3.5B-fp8-backup")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("9r4n4y/LongCat-AudioDiT-3.5B-fp8-backup", dtype="auto") - Notebooks
- Google Colab
- Kaggle
license: mit
language:
- zh
- en
pipeline_tag: text-to-speech
tags:
- comfyui
- audiodit
- transformers
LongCat-AudioDiT-3.5B — FP8 Weight-Only Quantized (ComfyUI)
FP8 quantized version of meituan-longcat/LongCat-AudioDiT-3.5B.
Original Model | Paper | GitHub (Original) | ComfyUI Node
What is this?
This is a weight-only FP8 quantization of LongCat-AudioDiT-3.5B — a state-of-the-art diffusion-based zero-shot TTS model by Meituan that operates directly in the waveform latent space. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~14 GB to ~4 GB, with no perceptible quality loss in practice.
| Original (3.5B FP32) | This (3.5B FP8) | |
|---|---|---|
| Weight dtype | float32 | float8_e4m3fn |
| Activation dtype | bfloat16 | bfloat16 |
| Scale | — | per-tensor float32 |
| File size | ~14 GB | ~4 GB |
| VRAM (inference) | ~20 GB | ~8 GB |
| Extra dependencies | none | none |
Quantization Details
What is quantized: All linear weight matrices in the DiT transformer backbone. Non-linear weights (VAE, embeddings, layer norms, text encoder) remain in bfloat16.
Method: Per-tensor symmetric FP8
Each weight matrix is quantized using a per-tensor scale factor stored in fp8_scales.json:
scale = max(abs(W)) / FP8_MAX # FP8_MAX = 448.0 for float8_e4m3fn
W_fp8 = round(W_fp32 / scale) # quantize
W_bf16 = W_fp8.to(bfloat16) * scale # dequantize at inference
No external quantization library required. Dequantization is handled automatically by the ComfyUI-LongCat-AudioDIT-TTS node pack at load time. The model is fully compatible with the original audiodit inference code once dequantized.
File layout:
model.safetensors— FP8 weight tensorsfp8_scales.json— per-tensor float32 scale factors for dequantization
Hardware Requirements
- GPU: NVIDIA GPU with CUDA support
- VRAM: ~8 GB
- Native FP8 tensor cores: Ada Lovelace or Blackwell (RTX 4090, RTX 5090, H100, etc.) — recommended for full speed
- Older GPUs (Ampere and below): Will load and run correctly. Dequantization to bfloat16 happens on all hardware, so you still get the reduced VRAM footprint even without native FP8 cores.
Usage — ComfyUI (Recommended)
The easiest way to use this model is with ComfyUI-LongCat-AudioDIT-TTS, which has native support for this FP8 model with zero extra setup.
Installation
- Install the ComfyUI node via ComfyUI Manager (search
LongCat-AudioDiT) or manually:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git
The model auto-downloads on first use — select
LongCat-AudioDiT-3.5B-fp8from the model dropdown in any LongCat node.Or download manually:
huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-fp8 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-fp8
Available Nodes
- LongCat AudioDiT TTS — Zero-shot text-to-speech
- LongCat AudioDiT Voice Clone TTS — Voice cloning from reference audio
- LongCat AudioDiT Multi-Speaker TTS — Multi-speaker conversation synthesis
Recommended Settings
dtype:autoorbf16— FP8 weights are dequantized to BF16 at load timeguidance_method:cfgfor TTS,apgfor voice cloningsteps:16(balanced),32(higher quality)keep_model_loaded:Truefor repeated use
About LongCat-AudioDiT
LongCat-AudioDiT is a non-autoregressive diffusion-based TTS model from Meituan that achieves state-of-the-art zero-shot voice cloning performance on the Seed benchmark. Unlike previous methods relying on mel-spectrograms, it operates directly in the waveform latent space using only a Wav-VAE and a DiT backbone.
The 3.5B variant achieves 0.818 SIM on Seed-ZH and 0.797 SIM on Seed-Hard, surpassing both open-source and closed-source competitors.
License
This model inherits the MIT License from meituan-longcat/LongCat-AudioDiT-3.5B.
The FP8 quantization was produced by drbaph and is released under the same license.
