Duplicate from drbaph/LongCat-AudioDiT-3.5B-fp8

9b53627 2 months ago

4.94 kB

license: mit
language:
  - zh
  - en
pipeline_tag: text-to-speech
tags:
  - comfyui
  - audiodit
  - transformers

LongCat-AudioDiT-3.5B — FP8 Weight-Only Quantized (ComfyUI)

FP8 quantized version of meituan-longcat/LongCat-AudioDiT-3.5B.

Original Model | Paper | GitHub (Original) | ComfyUI Node

What is this?

This is a weight-only FP8 quantization of LongCat-AudioDiT-3.5B — a state-of-the-art diffusion-based zero-shot TTS model by Meituan that operates directly in the waveform latent space. The quantization cuts the on-disk size roughly in half and reduces VRAM usage from ~14 GB to ~4 GB, with no perceptible quality loss in practice.

	Original (3.5B FP32)	This (3.5B FP8)
Weight dtype	float32	float8_e4m3fn
Activation dtype	bfloat16	bfloat16
Scale	—	per-tensor float32
File size	~14 GB	~4 GB
VRAM (inference)	~20 GB	~8 GB
Extra dependencies	none	none

Quantization Details

What is quantized: All linear weight matrices in the DiT transformer backbone. Non-linear weights (VAE, embeddings, layer norms, text encoder) remain in bfloat16.

Method: Per-tensor symmetric FP8

Each weight matrix is quantized using a per-tensor scale factor stored in fp8_scales.json:

scale = max(abs(W)) / FP8_MAX        # FP8_MAX = 448.0 for float8_e4m3fn
W_fp8 = round(W_fp32 / scale)        # quantize
W_bf16 = W_fp8.to(bfloat16) * scale  # dequantize at inference

No external quantization library required. Dequantization is handled automatically by the ComfyUI-LongCat-AudioDIT-TTS node pack at load time. The model is fully compatible with the original audiodit inference code once dequantized.

File layout:

model.safetensors — FP8 weight tensors
fp8_scales.json — per-tensor float32 scale factors for dequantization

Hardware Requirements

GPU: NVIDIA GPU with CUDA support
VRAM: ~8 GB
Native FP8 tensor cores: Ada Lovelace or Blackwell (RTX 4090, RTX 5090, H100, etc.) — recommended for full speed
Older GPUs (Ampere and below): Will load and run correctly. Dequantization to bfloat16 happens on all hardware, so you still get the reduced VRAM footprint even without native FP8 cores.

Usage — ComfyUI (Recommended)

The easiest way to use this model is with ComfyUI-LongCat-AudioDIT-TTS, which has native support for this FP8 model with zero extra setup.

Installation

Install the ComfyUI node via ComfyUI Manager (search LongCat-AudioDiT) or manually:

   cd ComfyUI/custom_nodes
   git clone https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git

The model auto-downloads on first use — select LongCat-AudioDiT-3.5B-fp8 from the model dropdown in any LongCat node.
Or download manually:

   huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-fp8 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-fp8

Available Nodes

LongCat AudioDiT TTS — Zero-shot text-to-speech
LongCat AudioDiT Voice Clone TTS — Voice cloning from reference audio
LongCat AudioDiT Multi-Speaker TTS — Multi-speaker conversation synthesis

Recommended Settings

dtype: auto or bf16 — FP8 weights are dequantized to BF16 at load time
guidance_method: cfg for TTS, apg for voice cloning
steps: 16 (balanced), 32 (higher quality)
keep_model_loaded: True for repeated use

About LongCat-AudioDiT

LongCat-AudioDiT is a non-autoregressive diffusion-based TTS model from Meituan that achieves state-of-the-art zero-shot voice cloning performance on the Seed benchmark. Unlike previous methods relying on mel-spectrograms, it operates directly in the waveform latent space using only a Wav-VAE and a DiT backbone.

The 3.5B variant achieves 0.818 SIM on Seed-ZH and 0.797 SIM on Seed-Hard, surpassing both open-source and closed-source competitors.

License

This model inherits the MIT License from meituan-longcat/LongCat-AudioDiT-3.5B.

The FP8 quantization was produced by drbaph and is released under the same license.