LongCat-AudioDiT Env-TTS β€” 5000-step (no-augmentation ablation)

Fine-tune of meituan-longcat/LongCat-AudioDiT-1B for the three-stream env-tts task: given a reference environment audio, a reference speaker audio, and three text streams (env caption / speaker caption / target speech text), generate target speech that places the target text in the referenced environment with the referenced speaker timbre.

Ablation β€” no augmentation. This checkpoint is trained with spk-audio augmentation (noise + RIR) disabled; otherwise it follows the same recipe as ChristianYang/LongCat-AudioDiT-Env-TTS-1B-8000Step (which trains with noise + RIR augmentation). Compare the two side by side to isolate the effect of spk-reference augmentation.

Differences from the base model

The transformer adds six learnable boundary tokens (three latent-space, three text-space):

latent sequence : [<boe>  z_env  <bos>  z_spk  <bon>  z_target]
text sequence   : [<boe_t> env_text_emb <bos_t> spk_text_emb <bon_t> target_text_emb]

encode_multistream_text(env, spk, target, drop_env_text=…, drop_spk_text=…, drop_target_text=…) is the new entry-point. AudioDiTModel.forward(...) also accepts a pre-assembled prompt_latent (replaces prompt_audio) so the inference path can feed the boundary-tokenized three-stream prompt directly.

Training summary

Field Value
Steps 5000
Effective batch 16 Γ— grad_accum 4 Γ— 1 GPU = 64 rows / step
Learning rate cosine 5e-5 (warmup 250)
AdamW β₁=0.9, Ξ²β‚‚=0.999, wd=0.01
EMA disabled
LoRA r=32, alpha=32, target = attn + ffn
Full-train boundary tokens + AdaLN + text_conv + latent_embed + latent_cond_embedder + input_embed + output_proj + time_embed
Audio filter target duration ∈ [3, 45] s; clip_peak_threshold = 2 (β‰ˆ no clipping filter)
RMS normalize three-stream independent to -23 dBFS (target_rms=0.0708)
Augmentation disabled β€” no noise / RIR on spk_audio (this is the ablation)
Data ChristianYang/Env-TTS-Clean

Evaluation

Generate-then-score on the held-out test split (n = 995), four-metric suite:

Metric Value
WER β€” char-level CER, Qwen3-ASR transcription ↓ 0.103
Speaker similarity β€” ReDimNet embedding cosine ↑ 0.702
CLAP env-similarity β€” laion/larger_clap_general audio cosine ↑ 0.711
Production-Quality gap β€” audiobox PQ(pred) βˆ’ PQ(GT) ↓ 0.366

For reference, the un-tuned base model on the same rows scores WER 0.638 / spk 0.601 / CLAP 0.541.

How to load

The model uses custom code in this repo, so pass trust_remote_code=True:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "ChristianYang/LongCat-AudioDiT-Env-TTS-1B-ablation",
    trust_remote_code=True,
).cuda().eval()

tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)

For end-to-end env-tts inference (three-stream prompt + ASR fallback for missing env/spk text) see the training repo's tasks/inference.py.

License

Inherits the original meituan-longcat/LongCat-AudioDiT-1B license.

Downloads last month
30
Safetensors
Model size
1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for humanify/LongCat-AudioDiT-Env-TTS-1B-ablation

Finetuned
(11)
this model