--- license: apache-2.0 language: - en - ar - de - el - es - fr - it - ja - ko - nl - pl - pt - vi - zh pipeline_tag: automatic-speech-recognition tags: - audio - speech-recognition - transcription - diarization - speaker-diarization - timestamps library_name: transformers base_model: CohereLabs/cohere-transcribe-03-2026 --- # Cohere Transcribe — Diarize + Timestamps (English) **Built by [syv.ai](https://syv.ai)** — a Danish AI company focused on shipping practical speech and language models. We release open-weights speech models so teams can build on top without leaving their own infrastructure. This model is `CohereLabs/cohere-transcribe-03-2026` fine-tuned to also emit **speaker labels** and **word-aligned timestamps** in a single decoder pass, while preserving the base model's transcription quality. It's a drop-in replacement when you need to know *who said what and when* on short-form audio (≤ 30 s), and pairs with our [`diarize_long_vllm`](#long-form-audio-30-s) helper for arbitrary-length recordings. > **Recommended deployment: vLLM** — see [Serving with vLLM](#serving-with-vllm-recommended). We measured **44× real-time end-to-end** on a 10-min clip with one RTX 3090 (decode 113× RTF, embed 16 seg/s), and **249× peak throughput** under concurrent load. Transformers works too and is shown first for a minimal example, but the vLLM path is what we run in production. **WE ARE LOOKING FOR COMPUTE PARTNERS TO FURTHER IMPROVE OUR MODELS - REACH OUT IF YOU CAN HELP**
| Name | cohere-transcribe-diarize |
|---|---|
| Base model | CohereLabs/cohere-transcribe-03-2026 (Apache 2.0, 2 B params) |
| Architecture | conformer-based encoder–decoder, full fine-tune (no LoRA) |
| Input | audio waveform (16 kHz mono, resampled automatically). Maximum supported clip length: 30 s — longer audio should be processed with sliding windows (see below) |
| Output | special-token stream interleaving speaker IDs, timestamps, and transcribed text, e.g. <|spltoken0|><|t:0.0|> Welcome back to the show.<|t:2.4|><|spltoken1|><|t:2.4|> Thanks for having me.<|t:3.8|> |
| Vocabulary extensions | 8 speaker tokens (<|spltoken0|>…<|spltoken7|>) + 300 timestamp tokens at 100 ms resolution (<|t:0.0|>…<|t:29.9|>) |
| Languages |
Primary: English (the diarization + timestamp fine-tune was done exclusively on English supervision). Likely usable (untested by us): the other 13 languages the Cohere Transcribe base supports — Arabic, German, Greek, Spanish, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Vietnamese, Chinese (Mandarin). The base model's multilingual transcription weights are preserved, and the diarization head conditions on language-agnostic speaker acoustics, so segmentation and speaker IDs should transfer; word-level timestamp accuracy will be best on English. Pass the matching language code in the prompt ( <|de|>, <|fr|>, …) to switch.
|
| License | Apache 2.0 (inherited from base) |