Automatic Speech Recognition
NeMo
Dutch
speech
parakeet
fastconformer
tdt
dutch
nvidia
common-voice
synthetic-speech
fine-tuned
Eval Results (legacy)
Instructions to use yuriyvnv/parakeet-tdt-0.6b-dutch with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use yuriyvnv/parakeet-tdt-0.6b-dutch with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-dutch") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Add test set results (WER/CER) and improve metadata
Browse files
README.md
CHANGED
|
@@ -12,6 +12,9 @@ tags:
|
|
| 12 |
- tdt
|
| 13 |
- dutch
|
| 14 |
- nvidia
|
|
|
|
|
|
|
|
|
|
| 15 |
datasets:
|
| 16 |
- fixie-ai/common_voice_17_0
|
| 17 |
- yuriyvnv/synthetic_transcript_nl
|
|
@@ -24,7 +27,7 @@ model-index:
|
|
| 24 |
type: automatic-speech-recognition
|
| 25 |
name: Speech Recognition
|
| 26 |
dataset:
|
| 27 |
-
name: Common Voice 17.0 (nl)
|
| 28 |
type: fixie-ai/common_voice_17_0
|
| 29 |
config: nl
|
| 30 |
split: validation
|
|
@@ -32,6 +35,24 @@ model-index:
|
|
| 32 |
- type: wer
|
| 33 |
value: 3.73
|
| 34 |
name: Val WER
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
---
|
| 36 |
|
| 37 |
# Parakeet-TDT-0.6B Dutch
|
|
@@ -45,11 +66,19 @@ A Dutch automatic speech recognition (ASR) model fine-tuned from [nvidia/parakee
|
|
| 45 |
| Base model | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
|
| 46 |
| Architecture | FastConformer-TDT (600M params) |
|
| 47 |
| Language | Dutch (nl) |
|
| 48 |
-
| Val WER | **3.73%** |
|
| 49 |
| Input | 16 kHz mono audio |
|
| 50 |
| Output | Dutch text with punctuation and capitalization |
|
| 51 |
| License | CC-BY-4.0 |
|
| 52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
## Training
|
| 54 |
|
| 55 |
Fine-tuned on a combination of:
|
|
@@ -66,6 +95,7 @@ Fine-tuned on a combination of:
|
|
| 66 |
| Warmup | 10% of total steps |
|
| 67 |
| Batch size | 64 |
|
| 68 |
| Precision | bf16-mixed |
|
|
|
|
| 69 |
| Early stopping | 10 epochs patience on val WER |
|
| 70 |
| Best epoch | 21 |
|
| 71 |
|
|
@@ -113,15 +143,15 @@ asr_model.change_attention_model(
|
|
| 113 |
output = asr_model.transcribe(["long_audio.wav"])
|
| 114 |
```
|
| 115 |
|
| 116 |
-
##
|
| 117 |
|
| 118 |
-
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
url={https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
|
| 126 |
-
}
|
| 127 |
-
```
|
|
|
|
| 12 |
- tdt
|
| 13 |
- dutch
|
| 14 |
- nvidia
|
| 15 |
+
- common-voice
|
| 16 |
+
- synthetic-speech
|
| 17 |
+
- fine-tuned
|
| 18 |
datasets:
|
| 19 |
- fixie-ai/common_voice_17_0
|
| 20 |
- yuriyvnv/synthetic_transcript_nl
|
|
|
|
| 27 |
type: automatic-speech-recognition
|
| 28 |
name: Speech Recognition
|
| 29 |
dataset:
|
| 30 |
+
name: Common Voice 17.0 (nl) - Validation
|
| 31 |
type: fixie-ai/common_voice_17_0
|
| 32 |
config: nl
|
| 33 |
split: validation
|
|
|
|
| 35 |
- type: wer
|
| 36 |
value: 3.73
|
| 37 |
name: Val WER
|
| 38 |
+
- type: cer
|
| 39 |
+
value: 1.02
|
| 40 |
+
name: Val CER
|
| 41 |
+
- task:
|
| 42 |
+
type: automatic-speech-recognition
|
| 43 |
+
name: Speech Recognition
|
| 44 |
+
dataset:
|
| 45 |
+
name: Common Voice 17.0 (nl) - Test
|
| 46 |
+
type: fixie-ai/common_voice_17_0
|
| 47 |
+
config: nl
|
| 48 |
+
split: test
|
| 49 |
+
metrics:
|
| 50 |
+
- type: wer
|
| 51 |
+
value: 5.33
|
| 52 |
+
name: Test WER
|
| 53 |
+
- type: cer
|
| 54 |
+
value: 1.46
|
| 55 |
+
name: Test CER
|
| 56 |
---
|
| 57 |
|
| 58 |
# Parakeet-TDT-0.6B Dutch
|
|
|
|
| 66 |
| Base model | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
|
| 67 |
| Architecture | FastConformer-TDT (600M params) |
|
| 68 |
| Language | Dutch (nl) |
|
|
|
|
| 69 |
| Input | 16 kHz mono audio |
|
| 70 |
| Output | Dutch text with punctuation and capitalization |
|
| 71 |
| License | CC-BY-4.0 |
|
| 72 |
|
| 73 |
+
## Evaluation Results
|
| 74 |
+
|
| 75 |
+
Evaluated on [Common Voice 17.0](https://huggingface.co/datasets/fixie-ai/common_voice_17_0) Dutch splits (raw text, no normalization):
|
| 76 |
+
|
| 77 |
+
| Split | WER | CER | Samples |
|
| 78 |
+
|---|---|---|---|
|
| 79 |
+
| Validation | **3.73%** | 1.02% | 9,062 |
|
| 80 |
+
| Test | **5.33%** | 1.46% | 11,266 |
|
| 81 |
+
|
| 82 |
## Training
|
| 83 |
|
| 84 |
Fine-tuned on a combination of:
|
|
|
|
| 95 |
| Warmup | 10% of total steps |
|
| 96 |
| Batch size | 64 |
|
| 97 |
| Precision | bf16-mixed |
|
| 98 |
+
| Gradient clipping | 1.0 |
|
| 99 |
| Early stopping | 10 epochs patience on val WER |
|
| 100 |
| Best epoch | 21 |
|
| 101 |
|
|
|
|
| 143 |
output = asr_model.transcribe(["long_audio.wav"])
|
| 144 |
```
|
| 145 |
|
| 146 |
+
## Intended Use
|
| 147 |
|
| 148 |
+
This model is designed for transcribing Dutch speech to text. It works best on:
|
| 149 |
+
- Read speech and conversational Dutch
|
| 150 |
+
- Audio recorded at 16 kHz or higher
|
| 151 |
+
- Segments up to 24 minutes (or longer with local attention enabled)
|
| 152 |
|
| 153 |
+
## Limitations
|
| 154 |
+
|
| 155 |
+
- Trained primarily on European Portuguese-accented Dutch from Common Voice; performance may vary on regional dialects or heavily accented speech
|
| 156 |
+
- Synthetic training data was generated with OpenAI TTS voices, which may not fully represent natural speech variability
|
| 157 |
+
- Not suitable for real-time streaming without additional configuration
|
|
|
|
|
|
|
|
|