Instructions to use laion/whisper-large-v3-turbo-audio-captioner with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use laion/whisper-large-v3-turbo-audio-captioner with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("laion/whisper-large-v3-turbo-audio-captioner") model = AutoModelForMultimodalLM.from_pretrained("laion/whisper-large-v3-turbo-audio-captioner") - Notebooks
- Google Colab
- Kaggle
YAML Metadata Warning:The pipeline tag "audio-to-text" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Whisper Large v3 Turbo - Audio Captioner (Intermediate Checkpoint)
This is an intermediate checkpoint and is NOT fully trained. It is released to enable further fine-tuning, experimentation, and community collaboration. Expect imperfect outputs, especially on non-speech environmental sounds.
Model Description
This model is a fine-tuned version of openai/whisper-large-v3-turbo (809M parameters) trained for audio captioning — generating natural language descriptions of audio content rather than transcription.
- Architecture: Whisper encoder-decoder (32 encoder layers, 4 decoder layers, d_model=1280)
- Parameters: 809M total (663M encoder, 172M decoder) — all parameters were trained (nothing frozen)
- Precision: BF16
- Max audio length: 30 seconds
- Max caption length: 448 tokens
Training
Datasets (~3.1M audio-caption pairs)
| Dataset | Pairs | Size | Caption Key |
|---|---|---|---|
| mitermix/audioset-with-grounded-captions | 1,760k | 207 GB | comprehensive_caption |
| laion/captioned-ai-music-snippets | 236k | 49 GB | comprehensive_caption |
| freesound | 372k | 91 GB | caption |
| laion/laions_got_talent_clean_with_captions | 178k | 36 GB | detailed_caption |
| laion/majestrino-data | 327k | 64 GB | detailed_caption |
| TTS-AGI/majestrino-unified-detailed-captions-temporal | 175k | 37 GB | caption |
| synthetic-vocal-bursts | 72k | 23 GB | caption |
Training Phases
The model was trained across multiple phases. This checkpoint is from the final phase:
| Phase | Epochs | LR | Schedule | Warmup | Data | Steps |
|---|---|---|---|---|---|---|
| 1–4 | 1 | 1e-5 | linear | 5% | 895k pairs (subset) | ~154k |
| 5 | 1 | 4e-5 | linear | 5% | 3.13M pairs (full) | 48,836 |
| 6 (this checkpoint) | 3 | 5e-5 | linear | 5% | 3.13M pairs (full) | 146,508 |
Training Details
- Hardware: 8x NVIDIA H100 80GB GPUs
- Effective batch size: 64 (8 GPUs x 8 per-device)
- Optimizer: AdamW (weight decay 0.0)
- Gradient norm clipping: 1.0
- Throughput: ~127 samples/second
- Phase 6 duration: ~21.5 hours
- Framework: Hugging Face Transformers 4.57.6 with
Seq2SeqTrainer
Metrics
- Best validation loss: 0.7498 (step 143,797)
- Final training loss: 0.34
- Total samples seen (all phases): ~19M+
Known Limitations
- Environmental sounds: Captions for non-speech audio (environmental sounds, sound effects) can hallucinate or produce incorrect descriptions. Speech and music captioning tends to be more reliable.
- Not converged: This is an intermediate checkpoint. Further training with more data and careful data balancing is expected to improve quality significantly.
- Whisper encoder bias: The Whisper encoder was pre-trained on speech, so it has a natural bias toward speech-like features.
Usage
Quick Inference
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model_id = "laion/whisper-large-v3-turbo-audio-captioner"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.float16, device_map="auto"
)
# Load your audio (16kHz mono)
import librosa
audio, sr = librosa.load("your_audio.wav", sr=16000)
# Process
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(model.device, dtype=torch.float16)
# Generate caption
with torch.no_grad():
predicted_ids = model.generate(
input_features,
max_new_tokens=448,
language="en",
task="transcribe",
)
caption = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(caption)
Continue Training
The repository includes the full training code used to produce this checkpoint:
train.py— Main training script (Seq2SeqTrainer-based, supports streaming and local data)prefetcher.py— Resilient tar streaming with prefetching for webdataset-style datamonitor.py— HTTP dashboard for real-time training monitoring (loss curves, eval samples, throughput)watchdog.sh— Auto-restart wrapper for training robustness
To continue training:
# Install dependencies
pip install transformers datasets torch librosa soundfile safetensors accelerate
# Set environment variables and launch
WHISPER_STREAM=1 \
WHISPER_RESUME_FROM=./best \
WHISPER_LR=5e-5 \
WHISPER_LR_SCHEDULE=linear \
WHISPER_WARMUP_RATIO=0.05 \
WHISPER_WEIGHTS_ONLY_RESUME=1 \
torchrun --nproc_per_node=8 train.py
Key environment variables:
WHISPER_RESUME_FROM— Path to checkpoint directory to resume fromWHISPER_LR— Peak learning rate (default: 5e-4)WHISPER_LR_SCHEDULE—linearorcosine(default: cosine)WHISPER_MAX_STEPS— Override total training stepsWHISPER_EPOCHS— Number of epochs (default: 1)WHISPER_WARMUP_RATIO— Warmup fraction (default: 0.05)WHISPER_WEIGHTS_ONLY_RESUME— Set to1to load only weights (fresh optimizer/scheduler)WHISPER_PER_DEVICE_BATCH— Per-GPU batch size (default: 8)
Citation
If you use this model, please cite:
@misc{laion-whisper-audio-captioner-2026,
title={Whisper Large v3 Turbo Audio Captioner},
author={LAION},
year={2026},
url={https://huggingface.co/laion/whisper-large-v3-turbo-audio-captioner}
}
License
Apache 2.0
- Downloads last month
- 22
Model tree for laion/whisper-large-v3-turbo-audio-captioner
Base model
openai/whisper-large-v3