Model is great but without timestampts at words level ?

#19

by repjhonblow - opened Mar 30

Mar 30

I'm working with a team on a product used for meetings transcription and summary with words level split per person.
I tried the model and yeah works great, but without any timing, let alone word level timestamp, we cannot use this.
Is there any way for us to extrapolate that kind of information?

grimavatar

Mar 30

< https://huggingface.co/CohereLabs/cohere-transcribe-03-2026/blob/main/modeling_cohere_asr.py

    def build_prompt(self, language: str, punctuation: bool = True) -> str:
        """Build the decoder prompt prefix for language and punctuation settings."""
        pnc_token = "<|pnc|>" if punctuation else "<|nopnc|>"
        task_token = "<|noitn|>"
        return (
            "<|startofcontext|><|startoftranscript|><|emo:undefined|>"
            f"<|{language}|><|{language}|>{pnc_token}{task_token}<|notimestamp|><|nodiarize|>"
        )

< https://huggingface.co/CohereLabs/cohere-transcribe-03-2026/blob/main/special_tokens_map.json

...
    "<|timestamp|>",
    "<|notimestamp|>",
    "<|diarize|>",
    "<|nodiarize|>",
...

This hints that either the model already supports it but it has not been implemented yet, or that it is something they plan to add in the next iteration.

One thing I'm also looking forward to is hotwords like in Vibevoice ASR. No matter how good ASR models get, they will always struggle with specific terms such as names, product titles, or niche vocabulary. Hotwords are essential for making these systems truly reliable in real world use.

Congratulations to the team for such amazing work.

stri8ted

Mar 30

< https://huggingface.co/CohereLabs/cohere-transcribe-03-2026/blob/main/modeling_cohere_asr.py
This hints that either the model already supports it but it has not been implemented yet, or that it is something they plan to add in the next iteration.

I can confirm the current model does not support timestamps. I replaced <|notimestamp|> with <|timestamp|> in the prompt, but the resulting output was the same.

grimavatar

Mar 31

< https://huggingface.co/CohereLabs/cohere-transcribe-03-2026/blob/main/modeling_cohere_asr.py
This hints that either the model already supports it but it has not been implemented yet, or that it is something they plan to add in the next iteration.
I can confirm the current model does not support timestamps. I replaced <|notimestamp|> with <|timestamp|> in the prompt, but the resulting output was the same.

Can you tell me how you ran this experiment? Where exactly did you apply the changes?

I ask because the current "modeling_cohere_asr.py" in the Transformers GitHub repository is not the same as the one in this repo. So downloading the repo, making the changes, and then running the example code would still use the version from Transformers, which does not mention the "build_prompt" method.

stri8ted

Mar 31

< https://huggingface.co/CohereLabs/cohere-transcribe-03-2026/blob/main/modeling_cohere_asr.py
This hints that either the model already supports it but it has not been implemented yet, or that it is something they plan to add in the next iteration.
I can confirm the current model does not support timestamps. I replaced <|notimestamp|> with <|timestamp|> in the prompt, but the resulting output was the same.
Can you tell me how you ran this experiment? Where exactly did you apply the changes?

I ask because the current "modeling_cohere_asr.py" in the Transformers GitHub repository is not the same as the one in this repo. So downloading the repo, making the changes, and then running the example code would still use the version from Transformers, which does not mention the "build_prompt" method.

I modified the Transformer file directly after pip install. The file is

site-packages/transformers/models/cohere_asr/processing_cohere_asr.py

the method is get_decoder_prompt_ids

grimavatar

Mar 31

I modified the Transformer file directly after pip install. The file is

site-packages/transformers/models/cohere_asr/processing_cohere_asr.py

the method is get_decoder_prompt_ids

Thanks a lot. I haven't had much time to play with it yet, except for brainstorming ideas in my head lol

Did you try changing <|nodiarize|> to <|diarize|> as well?

Side question, what did you think of the model? I don't currently have a use for local ASR in my workflow. When I need it, I usually just drop the audio into the ElevenLabs free STT homepage, but this one really surprised me. Parakeet v3 seems more accurate with the right terminology, but it misses a lot of speech, while this model captures almost everything.

repjhonblow

Mar 31

I made a test using VAD + this model + word level timestamps... tested on 2 real zoom meetings, 20 mins and 1h and 30 mins meeting. And it works surprisly well. If anybody want to test, feel free.
https://github.com/repka3/cohere-transcribe-vad-word-timestamps

julianmack

Cohere Labs org Mar 31

Nice @repjhonblow 🙌
Yes we don't have timestamps in this version but it is something we are actively working on for a future version

cstr

Apr 7

Here is an attempt at this in C++ CrispStrobe/cohere-whisper.cpp:

Since the model itself can't be coaxed into precise timestamps, I added a separate path that uses an external CTC model purely for alignment. This is the same approach as Montreal Forced Aligner, ctc-segmentation, and ctc-forced-aligner. Brings word MAE down to the 30–50 ms range.
Pipeline: Cohere transcribes → words extracted from the per-token result → the same audio is encoded by a Wav2Vec2ForCTC model running on ggml → constrained Viterbi DP forces an alignment of the transcript onto the per-frame CTC logits → each word gets its actual [t0 → t1]. The wav2vec2 ggml inference is adapted from nabil6391/wav2vec2.cpp (MIT) — split so logits are exposed as a [T × V] matrix and then consumed by an in-tree Viterbi aligner (src/align.cpp, ~250 lines, log-space DP with CTC blank-padded label sequences).

Build:

cmake --build build -j$(nproc) --target cohere-align

One-time CTC model setup (the script lives in models/):

pip install gguf transformers torch huggingface_hub
python -c "from huggingface_hub import snapshot_download; \
  print(snapshot_download('jonatasgrosman/wav2vec2-large-xlsr-53-english'))"
python models/convert-wav2vec2-to-gguf.py \
    --model-dir <snapshot-path> --output wav2vec2-xlsr-en.gguf

The jonatasgrosman/wav2vec2-large-xlsr-53-* series covers all 14 Cohere-supported languages (-english, -french, -german, -spanish, …). Per-language fine-tunes give the best alignment.

Run:

./build/bin/cohere-align \
    -m  cohere-transcribe-q4_k.gguf \
    -cw wav2vec2-xlsr-en.gguf \
    -f  samples/jfk.wav -osrt

Without CTC, it is not possible to do this reliably, but anyway, toyed with 3 basic standalone paths:

Segment-level — cohere-main -ts. One timestamp per ~30 s audio chunk. Accurate, because the boundaries come from the audio chunking, not the model. Might be OK for "where in the file is the speech" but useless e.g. for subtitles.
VAD-level — cohere-main -ts -vad-model ggml-silero-vad.bin. One timestamp per Silero VAD speech segment. Accurate, because the boundaries are real silence gaps. May suffice for SRT/VTT at sentence granularity.
Token-level via cross-attention DTW — cohere-main -ts -ml 1: One timestamp per word, derived from the decoder's cross-attention. Not accurate — usable only as a vague approximation. At each autoregressive decode step we capture softmax(CQ @ CK^T) from the last decoder layer (the attention vector over encoder frames) and store it per generated token. After decoding, a prefix-max DTW over the [n_tokens × T_enc] matrix finds the globally optimal monotone path mapping each token to a frame. Frame × 8 cs = timestamp. Two corrections on top (subtract the per-frame mean across all tokens before the DP, and one-step offset).

Also quickly checked the 8 last-layer heads — each has 0.52–0.76 s MAE individually; averaging is the best I could quickly do (0.36 s MAE, with 22% of words >0.5 s off). BUT:

Content words with strong phonetic identity ("Americans", "country", "not") land within ~200 ms.
Function words and repeats ("and", "my", "fellow") drift — the attention just doesn't encode a clear temporal position for them.
Per-head selection à la Whisper doesn't help as the model seemingly wasn't trained for alignment, so no head has a clean time signal.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment