Automatic Speech Recognition
NeMo
Dutch
speech
parakeet
fastconformer
tdt
dutch
nvidia
common-voice
synthetic-speech
fine-tuned
Eval Results (legacy)
Instructions to use yuriyvnv/parakeet-tdt-0.6b-dutch with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use yuriyvnv/parakeet-tdt-0.6b-dutch with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-dutch") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
metadata
language:
- nl
license: cc-by-4.0
library_name: nemo
tags:
- automatic-speech-recognition
- speech
- nemo
- parakeet
- fastconformer
- tdt
- dutch
- nvidia
- common-voice
- synthetic-speech
- fine-tuned
datasets:
- fixie-ai/common_voice_17_0
- yuriyvnv/synthetic_transcript_nl
base_model: nvidia/parakeet-tdt-0.6b-v3
pipeline_tag: automatic-speech-recognition
model-index:
- name: parakeet-tdt-0.6b-dutch
results:
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: Common Voice 17.0 (nl) - Validation
type: fixie-ai/common_voice_17_0
config: nl
split: validation
metrics:
- type: wer
value: 3.73
name: Val WER
- type: cer
value: 1.02
name: Val CER
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: Common Voice 17.0 (nl) - Test
type: fixie-ai/common_voice_17_0
config: nl
split: test
metrics:
- type: wer
value: 5.33
name: Test WER
- type: cer
value: 1.46
name: Test CER
Parakeet-TDT-0.6B Dutch
A Dutch automatic speech recognition (ASR) model fine-tuned from nvidia/parakeet-tdt-0.6b-v3.
Model Details
| Property | Value |
|---|---|
| Base model | nvidia/parakeet-tdt-0.6b-v3 |
| Architecture | FastConformer-TDT (600M params) |
| Language | Dutch (nl) |
| Input | 16 kHz mono audio |
| Output | Dutch text with punctuation and capitalization |
| License | CC-BY-4.0 |
Evaluation Results
Evaluated on Common Voice 17.0 Dutch splits (raw text, no normalization):
| Split | WER | CER | Samples |
|---|---|---|---|
| Validation | 3.73% | 1.02% | 9,062 |
| Test | 5.33% | 1.46% | 11,266 |
Training
Fine-tuned on a combination of:
- Common Voice 17.0 (nl) -- human-recorded Dutch speech
- Synthetic Transcript NL -- 34,898 synthetic Dutch speech samples generated with OpenAI TTS
Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 5e-5 (cosine annealing) |
| Warmup | 10% of total steps |
| Batch size | 64 |
| Precision | bf16-mixed |
| Gradient clipping | 1.0 |
| Early stopping | 10 epochs patience on val WER |
| Best epoch | 21 |
Usage
Installation
pip install nemo_toolkit[asr]
Transcribe Audio
import nemo.collections.asr as nemo_asr
# Load model
asr_model = nemo_asr.models.ASRModel.from_pretrained(
model_name="yuriyvnv/parakeet-tdt-0.6b-dutch"
)
# Transcribe
output = asr_model.transcribe(["audio.wav"])
print(output[0].text)
Transcribe with Timestamps
output = asr_model.transcribe(["audio.wav"], timestamps=True)
for stamp in output[0].timestamp["segment"]:
print(f"{stamp['start']:.1f}s - {stamp['end']:.1f}s : {stamp['segment']}")
Long-Form Audio
For audio longer than 24 minutes, enable local attention:
asr_model.change_attention_model(
self_attention_model="rel_pos_local_attn",
att_context_size=[256, 256],
)
output = asr_model.transcribe(["long_audio.wav"])
Intended Use
This model is designed for transcribing Dutch speech to text. It works best on:
- Read speech and conversational Dutch
- Audio recorded at 16 kHz or higher
- Segments up to 24 minutes (or longer with local attention enabled)
Limitations
- Trained primarily on European Portuguese-accented Dutch from Common Voice; performance may vary on regional dialects or heavily accented speech
- Synthetic training data was generated with OpenAI TTS voices, which may not fully represent natural speech variability
- Not suitable for real-time streaming without additional configuration