yuriyvnv commited on
Commit
6c413e4
·
verified ·
1 Parent(s): df94cda

Add test set results (WER/CER) and improve metadata

Browse files
Files changed (1) hide show
  1. README.md +42 -12
README.md CHANGED
@@ -12,6 +12,9 @@ tags:
12
  - tdt
13
  - dutch
14
  - nvidia
 
 
 
15
  datasets:
16
  - fixie-ai/common_voice_17_0
17
  - yuriyvnv/synthetic_transcript_nl
@@ -24,7 +27,7 @@ model-index:
24
  type: automatic-speech-recognition
25
  name: Speech Recognition
26
  dataset:
27
- name: Common Voice 17.0 (nl)
28
  type: fixie-ai/common_voice_17_0
29
  config: nl
30
  split: validation
@@ -32,6 +35,24 @@ model-index:
32
  - type: wer
33
  value: 3.73
34
  name: Val WER
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  ---
36
 
37
  # Parakeet-TDT-0.6B Dutch
@@ -45,11 +66,19 @@ A Dutch automatic speech recognition (ASR) model fine-tuned from [nvidia/parakee
45
  | Base model | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
46
  | Architecture | FastConformer-TDT (600M params) |
47
  | Language | Dutch (nl) |
48
- | Val WER | **3.73%** |
49
  | Input | 16 kHz mono audio |
50
  | Output | Dutch text with punctuation and capitalization |
51
  | License | CC-BY-4.0 |
52
 
 
 
 
 
 
 
 
 
 
53
  ## Training
54
 
55
  Fine-tuned on a combination of:
@@ -66,6 +95,7 @@ Fine-tuned on a combination of:
66
  | Warmup | 10% of total steps |
67
  | Batch size | 64 |
68
  | Precision | bf16-mixed |
 
69
  | Early stopping | 10 epochs patience on val WER |
70
  | Best epoch | 21 |
71
 
@@ -113,15 +143,15 @@ asr_model.change_attention_model(
113
  output = asr_model.transcribe(["long_audio.wav"])
114
  ```
115
 
116
- ## Citation
117
 
118
- If you use this model, please cite the base Parakeet model:
 
 
 
119
 
120
- ```bibtex
121
- @misc{parakeet-tdt-0.6b-v3,
122
- title={Parakeet TDT 0.6B v3},
123
- author={NVIDIA},
124
- year={2025},
125
- url={https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
126
- }
127
- ```
 
12
  - tdt
13
  - dutch
14
  - nvidia
15
+ - common-voice
16
+ - synthetic-speech
17
+ - fine-tuned
18
  datasets:
19
  - fixie-ai/common_voice_17_0
20
  - yuriyvnv/synthetic_transcript_nl
 
27
  type: automatic-speech-recognition
28
  name: Speech Recognition
29
  dataset:
30
+ name: Common Voice 17.0 (nl) - Validation
31
  type: fixie-ai/common_voice_17_0
32
  config: nl
33
  split: validation
 
35
  - type: wer
36
  value: 3.73
37
  name: Val WER
38
+ - type: cer
39
+ value: 1.02
40
+ name: Val CER
41
+ - task:
42
+ type: automatic-speech-recognition
43
+ name: Speech Recognition
44
+ dataset:
45
+ name: Common Voice 17.0 (nl) - Test
46
+ type: fixie-ai/common_voice_17_0
47
+ config: nl
48
+ split: test
49
+ metrics:
50
+ - type: wer
51
+ value: 5.33
52
+ name: Test WER
53
+ - type: cer
54
+ value: 1.46
55
+ name: Test CER
56
  ---
57
 
58
  # Parakeet-TDT-0.6B Dutch
 
66
  | Base model | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
67
  | Architecture | FastConformer-TDT (600M params) |
68
  | Language | Dutch (nl) |
 
69
  | Input | 16 kHz mono audio |
70
  | Output | Dutch text with punctuation and capitalization |
71
  | License | CC-BY-4.0 |
72
 
73
+ ## Evaluation Results
74
+
75
+ Evaluated on [Common Voice 17.0](https://huggingface.co/datasets/fixie-ai/common_voice_17_0) Dutch splits (raw text, no normalization):
76
+
77
+ | Split | WER | CER | Samples |
78
+ |---|---|---|---|
79
+ | Validation | **3.73%** | 1.02% | 9,062 |
80
+ | Test | **5.33%** | 1.46% | 11,266 |
81
+
82
  ## Training
83
 
84
  Fine-tuned on a combination of:
 
95
  | Warmup | 10% of total steps |
96
  | Batch size | 64 |
97
  | Precision | bf16-mixed |
98
+ | Gradient clipping | 1.0 |
99
  | Early stopping | 10 epochs patience on val WER |
100
  | Best epoch | 21 |
101
 
 
143
  output = asr_model.transcribe(["long_audio.wav"])
144
  ```
145
 
146
+ ## Intended Use
147
 
148
+ This model is designed for transcribing Dutch speech to text. It works best on:
149
+ - Read speech and conversational Dutch
150
+ - Audio recorded at 16 kHz or higher
151
+ - Segments up to 24 minutes (or longer with local attention enabled)
152
 
153
+ ## Limitations
154
+
155
+ - Trained primarily on European Portuguese-accented Dutch from Common Voice; performance may vary on regional dialects or heavily accented speech
156
+ - Synthetic training data was generated with OpenAI TTS voices, which may not fully represent natural speech variability
157
+ - Not suitable for real-time streaming without additional configuration