Instructions to use humanify/LongCat-AudioDiT-Env-TTS-1B-ablation with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use humanify/LongCat-AudioDiT-Env-TTS-1B-ablation with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="humanify/LongCat-AudioDiT-Env-TTS-1B-ablation", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("humanify/LongCat-AudioDiT-Env-TTS-1B-ablation", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
LongCat-AudioDiT Env-TTS β 5000-step (no-augmentation ablation)
Fine-tune of meituan-longcat/LongCat-AudioDiT-1B for the three-stream env-tts task: given a reference environment audio, a reference speaker audio, and three text streams (env caption / speaker caption / target speech text), generate target speech that places the target text in the referenced environment with the referenced speaker timbre.
Ablation β no augmentation. This checkpoint is trained with spk-audio augmentation (noise + RIR) disabled; otherwise it follows the same recipe as ChristianYang/LongCat-AudioDiT-Env-TTS-1B-8000Step (which trains with noise + RIR augmentation). Compare the two side by side to isolate the effect of spk-reference augmentation.
Differences from the base model
The transformer adds six learnable boundary tokens (three latent-space, three text-space):
latent sequence : [<boe> z_env <bos> z_spk <bon> z_target]
text sequence : [<boe_t> env_text_emb <bos_t> spk_text_emb <bon_t> target_text_emb]
encode_multistream_text(env, spk, target, drop_env_text=β¦, drop_spk_text=β¦, drop_target_text=β¦) is the new entry-point. AudioDiTModel.forward(...) also
accepts a pre-assembled prompt_latent (replaces prompt_audio) so the inference
path can feed the boundary-tokenized three-stream prompt directly.
Training summary
| Field | Value |
|---|---|
| Steps | 5000 |
| Effective batch | 16 Γ grad_accum 4 Γ 1 GPU = 64 rows / step |
| Learning rate | cosine 5e-5 (warmup 250) |
| AdamW | Ξ²β=0.9, Ξ²β=0.999, wd=0.01 |
| EMA | disabled |
| LoRA | r=32, alpha=32, target = attn + ffn |
| Full-train | boundary tokens + AdaLN + text_conv + latent_embed + latent_cond_embedder + input_embed + output_proj + time_embed |
| Audio filter | target duration β [3, 45] s; clip_peak_threshold = 2 (β no clipping filter) |
| RMS normalize | three-stream independent to -23 dBFS (target_rms=0.0708) |
| Augmentation | disabled β no noise / RIR on spk_audio (this is the ablation) |
| Data | ChristianYang/Env-TTS-Clean |
Evaluation
Generate-then-score on the held-out test split (n = 995), four-metric suite:
| Metric | Value |
|---|---|
| WER β char-level CER, Qwen3-ASR transcription β | 0.103 |
| Speaker similarity β ReDimNet embedding cosine β | 0.702 |
CLAP env-similarity β laion/larger_clap_general audio cosine β |
0.711 |
Production-Quality gap β audiobox PQ(pred) β PQ(GT) β |
0.366 |
For reference, the un-tuned base model on the same rows scores WER 0.638 / spk 0.601 / CLAP 0.541.
How to load
The model uses custom code in this repo, so pass trust_remote_code=True:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"ChristianYang/LongCat-AudioDiT-Env-TTS-1B-ablation",
trust_remote_code=True,
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)
For end-to-end env-tts inference (three-stream prompt + ASR fallback for missing
env/spk text) see the training repo's tasks/inference.py.
License
Inherits the original meituan-longcat/LongCat-AudioDiT-1B license.
- Downloads last month
- 30
Model tree for humanify/LongCat-AudioDiT-Env-TTS-1B-ablation
Base model
meituan-longcat/LongCat-AudioDiT-1B