Add test set results (WER/CER) and improve metadata

6c413e4 verified about 2 months ago

4.17 kB

language:
  - nl
license: cc-by-4.0
library_name: nemo
tags:
  - automatic-speech-recognition
  - speech
  - nemo
  - parakeet
  - fastconformer
  - tdt
  - dutch
  - nvidia
  - common-voice
  - synthetic-speech
  - fine-tuned
datasets:
  - fixie-ai/common_voice_17_0
  - yuriyvnv/synthetic_transcript_nl
base_model: nvidia/parakeet-tdt-0.6b-v3
pipeline_tag: automatic-speech-recognition
model-index:
  - name: parakeet-tdt-0.6b-dutch
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          name: Common Voice 17.0 (nl) - Validation
          type: fixie-ai/common_voice_17_0
          config: nl
          split: validation
        metrics:
          - type: wer
            value: 3.73
            name: Val WER
          - type: cer
            value: 1.02
            name: Val CER
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          name: Common Voice 17.0 (nl) - Test
          type: fixie-ai/common_voice_17_0
          config: nl
          split: test
        metrics:
          - type: wer
            value: 5.33
            name: Test WER
          - type: cer
            value: 1.46
            name: Test CER

Parakeet-TDT-0.6B Dutch

A Dutch automatic speech recognition (ASR) model fine-tuned from nvidia/parakeet-tdt-0.6b-v3.

Model Details

Property	Value
Base model	nvidia/parakeet-tdt-0.6b-v3
Architecture	FastConformer-TDT (600M params)
Language	Dutch (nl)
Input	16 kHz mono audio
Output	Dutch text with punctuation and capitalization
License	CC-BY-4.0

Evaluation Results

Evaluated on Common Voice 17.0 Dutch splits (raw text, no normalization):

Split	WER	CER	Samples
Validation	3.73%	1.02%	9,062
Test	5.33%	1.46%	11,266

Training

Fine-tuned on a combination of:

Common Voice 17.0 (nl) -- human-recorded Dutch speech
Synthetic Transcript NL -- 34,898 synthetic Dutch speech samples generated with OpenAI TTS

Training Configuration

Parameter	Value
Optimizer	AdamW
Learning rate	5e-5 (cosine annealing)
Warmup	10% of total steps
Batch size	64
Precision	bf16-mixed
Gradient clipping	1.0
Early stopping	10 epochs patience on val WER
Best epoch	21

Usage

Installation

pip install nemo_toolkit[asr]

Transcribe Audio

import nemo.collections.asr as nemo_asr

# Load model
asr_model = nemo_asr.models.ASRModel.from_pretrained(
    model_name="yuriyvnv/parakeet-tdt-0.6b-dutch"
)

# Transcribe
output = asr_model.transcribe(["audio.wav"])
print(output[0].text)

Transcribe with Timestamps

output = asr_model.transcribe(["audio.wav"], timestamps=True)

for stamp in output[0].timestamp["segment"]:
    print(f"{stamp['start']:.1f}s - {stamp['end']:.1f}s : {stamp['segment']}")

Long-Form Audio

For audio longer than 24 minutes, enable local attention:

asr_model.change_attention_model(
    self_attention_model="rel_pos_local_attn",
    att_context_size=[256, 256],
)
output = asr_model.transcribe(["long_audio.wav"])

Intended Use

This model is designed for transcribing Dutch speech to text. It works best on:

Read speech and conversational Dutch
Audio recorded at 16 kHz or higher
Segments up to 24 minutes (or longer with local attention enabled)

Limitations

Trained primarily on European Portuguese-accented Dutch from Common Voice; performance may vary on regional dialects or heavily accented speech
Synthetic training data was generated with OpenAI TTS voices, which may not fully represent natural speech variability
Not suitable for real-time streaming without additional configuration