9r4n4y's picture
Duplicate from drbaph/LongCat-AudioDiT-3.5B-fp8
9b53627
|
Raw
History Blame Contribute Delete
4.94 kB
metadata
license: mit
language:
  - zh
  - en
pipeline_tag: text-to-speech
tags:
  - comfyui
  - audiodit
  - transformers

LongCat-AudioDiT-3.5B — FP8 Weight-Only Quantized (ComfyUI)

FP8 quantized version of meituan-longcat/LongCat-AudioDiT-3.5B.

Original Model | Paper | GitHub (Original) | ComfyUI Node


Screenshot 2026-03-30 210100

What is this?

This is a weight-only FP8 quantization of LongCat-AudioDiT-3.5B — a state-of-the-art diffusion-based zero-shot TTS model by Meituan that operates directly in the waveform latent space. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~14 GB to ~4 GB, with no perceptible quality loss in practice.

Original (3.5B FP32) This (3.5B FP8)
Weight dtype float32 float8_e4m3fn
Activation dtype bfloat16 bfloat16
Scale — per-tensor float32
File size ~14 GB ~4 GB
VRAM (inference) ~20 GB ~8 GB
Extra dependencies none none

Quantization Details

What is quantized: All linear weight matrices in the DiT transformer backbone. Non-linear weights (VAE, embeddings, layer norms, text encoder) remain in bfloat16.

Method: Per-tensor symmetric FP8

Each weight matrix is quantized using a per-tensor scale factor stored in fp8_scales.json:

scale = max(abs(W)) / FP8_MAX        # FP8_MAX = 448.0 for float8_e4m3fn
W_fp8 = round(W_fp32 / scale)        # quantize
W_bf16 = W_fp8.to(bfloat16) * scale  # dequantize at inference

No external quantization library required. Dequantization is handled automatically by the ComfyUI-LongCat-AudioDIT-TTS node pack at load time. The model is fully compatible with the original audiodit inference code once dequantized.

File layout:

  • model.safetensors — FP8 weight tensors
  • fp8_scales.json — per-tensor float32 scale factors for dequantization

Hardware Requirements

  • GPU: NVIDIA GPU with CUDA support
  • VRAM: ~8 GB
  • Native FP8 tensor cores: Ada Lovelace or Blackwell (RTX 4090, RTX 5090, H100, etc.) — recommended for full speed
  • Older GPUs (Ampere and below): Will load and run correctly. Dequantization to bfloat16 happens on all hardware, so you still get the reduced VRAM footprint even without native FP8 cores.

Usage — ComfyUI (Recommended)

The easiest way to use this model is with ComfyUI-LongCat-AudioDIT-TTS, which has native support for this FP8 model with zero extra setup.

Installation

  1. Install the ComfyUI node via ComfyUI Manager (search LongCat-AudioDiT) or manually:
   cd ComfyUI/custom_nodes
   git clone https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git
  1. The model auto-downloads on first use — select LongCat-AudioDiT-3.5B-fp8 from the model dropdown in any LongCat node.

  2. Or download manually:

   huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-fp8 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-fp8

Available Nodes

  • LongCat AudioDiT TTS — Zero-shot text-to-speech
  • LongCat AudioDiT Voice Clone TTS — Voice cloning from reference audio
  • LongCat AudioDiT Multi-Speaker TTS — Multi-speaker conversation synthesis

Recommended Settings

  • dtype: auto or bf16 — FP8 weights are dequantized to BF16 at load time
  • guidance_method: cfg for TTS, apg for voice cloning
  • steps: 16 (balanced), 32 (higher quality)
  • keep_model_loaded: True for repeated use

About LongCat-AudioDiT

LongCat-AudioDiT is a non-autoregressive diffusion-based TTS model from Meituan that achieves state-of-the-art zero-shot voice cloning performance on the Seed benchmark. Unlike previous methods relying on mel-spectrograms, it operates directly in the waveform latent space using only a Wav-VAE and a DiT backbone.

The 3.5B variant achieves 0.818 SIM on Seed-ZH and 0.797 SIM on Seed-Hard, surpassing both open-source and closed-source competitors.


License

This model inherits the MIT License from meituan-longcat/LongCat-AudioDiT-3.5B.

The FP8 quantization was produced by drbaph and is released under the same license.