Automatic Speech Recognition
NeMo
Dutch
speech
parakeet
fastconformer
tdt
dutch
nvidia
common-voice
synthetic-speech
fine-tuned
Eval Results (legacy)
Instructions to use yuriyvnv/parakeet-tdt-0.6b-dutch with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use yuriyvnv/parakeet-tdt-0.6b-dutch with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-dutch") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
| language: | |
| - nl | |
| license: cc-by-4.0 | |
| library_name: nemo | |
| tags: | |
| - automatic-speech-recognition | |
| - speech | |
| - nemo | |
| - parakeet | |
| - fastconformer | |
| - tdt | |
| - dutch | |
| - nvidia | |
| - common-voice | |
| - synthetic-speech | |
| - fine-tuned | |
| datasets: | |
| - fixie-ai/common_voice_17_0 | |
| - yuriyvnv/synthetic_transcript_nl | |
| base_model: nvidia/parakeet-tdt-0.6b-v3 | |
| pipeline_tag: automatic-speech-recognition | |
| model-index: | |
| - name: parakeet-tdt-0.6b-dutch | |
| results: | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Speech Recognition | |
| dataset: | |
| name: Common Voice 17.0 (nl) - Validation | |
| type: fixie-ai/common_voice_17_0 | |
| config: nl | |
| split: validation | |
| metrics: | |
| - type: wer | |
| value: 3.73 | |
| name: Val WER | |
| - type: cer | |
| value: 1.02 | |
| name: Val CER | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Speech Recognition | |
| dataset: | |
| name: Common Voice 17.0 (nl) - Test | |
| type: fixie-ai/common_voice_17_0 | |
| config: nl | |
| split: test | |
| metrics: | |
| - type: wer | |
| value: 5.33 | |
| name: Test WER | |
| - type: cer | |
| value: 1.46 | |
| name: Test CER | |
| # Parakeet-TDT-0.6B Dutch | |
| A Dutch automatic speech recognition (ASR) model fine-tuned from [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3). | |
| ## Model Details | |
| | Property | Value | | |
| |---|---| | |
| | Base model | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) | | |
| | Architecture | FastConformer-TDT (600M params) | | |
| | Language | Dutch (nl) | | |
| | Input | 16 kHz mono audio | | |
| | Output | Dutch text with punctuation and capitalization | | |
| | License | CC-BY-4.0 | | |
| ## Evaluation Results | |
| Evaluated on [Common Voice 17.0](https://huggingface.co/datasets/fixie-ai/common_voice_17_0) Dutch splits (raw text, no normalization): | |
| | Split | WER | CER | Samples | | |
| |---|---|---|---| | |
| | Validation | **3.73%** | 1.02% | 9,062 | | |
| | Test | **5.33%** | 1.46% | 11,266 | | |
| ## Training | |
| Fine-tuned on a combination of: | |
| - **[Common Voice 17.0](https://huggingface.co/datasets/fixie-ai/common_voice_17_0)** (nl) -- human-recorded Dutch speech | |
| - **[Synthetic Transcript NL](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl)** -- 34,898 synthetic Dutch speech samples generated with OpenAI TTS | |
| ### Training Configuration | |
| | Parameter | Value | | |
| |---|---| | |
| | Optimizer | AdamW | | |
| | Learning rate | 5e-5 (cosine annealing) | | |
| | Warmup | 10% of total steps | | |
| | Batch size | 64 | | |
| | Precision | bf16-mixed | | |
| | Gradient clipping | 1.0 | | |
| | Early stopping | 10 epochs patience on val WER | | |
| | Best epoch | 21 | | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install nemo_toolkit[asr] | |
| ``` | |
| ### Transcribe Audio | |
| ```python | |
| import nemo.collections.asr as nemo_asr | |
| # Load model | |
| asr_model = nemo_asr.models.ASRModel.from_pretrained( | |
| model_name="yuriyvnv/parakeet-tdt-0.6b-dutch" | |
| ) | |
| # Transcribe | |
| output = asr_model.transcribe(["audio.wav"]) | |
| print(output[0].text) | |
| ``` | |
| ### Transcribe with Timestamps | |
| ```python | |
| output = asr_model.transcribe(["audio.wav"], timestamps=True) | |
| for stamp in output[0].timestamp["segment"]: | |
| print(f"{stamp['start']:.1f}s - {stamp['end']:.1f}s : {stamp['segment']}") | |
| ``` | |
| ### Long-Form Audio | |
| For audio longer than 24 minutes, enable local attention: | |
| ```python | |
| asr_model.change_attention_model( | |
| self_attention_model="rel_pos_local_attn", | |
| att_context_size=[256, 256], | |
| ) | |
| output = asr_model.transcribe(["long_audio.wav"]) | |
| ``` | |
| ## Intended Use | |
| This model is designed for transcribing Dutch speech to text. It works best on: | |
| - Read speech and conversational Dutch | |
| - Audio recorded at 16 kHz or higher | |
| - Segments up to 24 minutes (or longer with local attention enabled) | |
| ## Limitations | |
| - Trained primarily on European Portuguese-accented Dutch from Common Voice; performance may vary on regional dialects or heavily accented speech | |
| - Synthetic training data was generated with OpenAI TTS voices, which may not fully represent natural speech variability | |
| - Not suitable for real-time streaming without additional configuration | |