--- language: - nl license: cc-by-4.0 library_name: nemo tags: - automatic-speech-recognition - speech - nemo - parakeet - fastconformer - tdt - dutch - nvidia - common-voice - synthetic-speech - fine-tuned datasets: - fixie-ai/common_voice_17_0 - yuriyvnv/synthetic_transcript_nl base_model: nvidia/parakeet-tdt-0.6b-v3 pipeline_tag: automatic-speech-recognition model-index: - name: parakeet-tdt-0.6b-dutch results: - task: type: automatic-speech-recognition name: Speech Recognition dataset: name: Common Voice 17.0 (nl) - Validation type: fixie-ai/common_voice_17_0 config: nl split: validation metrics: - type: wer value: 3.73 name: Val WER - type: cer value: 1.02 name: Val CER - task: type: automatic-speech-recognition name: Speech Recognition dataset: name: Common Voice 17.0 (nl) - Test type: fixie-ai/common_voice_17_0 config: nl split: test metrics: - type: wer value: 5.33 name: Test WER - type: cer value: 1.46 name: Test CER --- # Parakeet-TDT-0.6B Dutch A Dutch automatic speech recognition (ASR) model fine-tuned from [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3). ## Model Details | Property | Value | |---|---| | Base model | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) | | Architecture | FastConformer-TDT (600M params) | | Language | Dutch (nl) | | Input | 16 kHz mono audio | | Output | Dutch text with punctuation and capitalization | | License | CC-BY-4.0 | ## Evaluation Results Evaluated on [Common Voice 17.0](https://huggingface.co/datasets/fixie-ai/common_voice_17_0) Dutch splits (raw text, no normalization): | Split | WER | CER | Samples | |---|---|---|---| | Validation | **3.73%** | 1.02% | 9,062 | | Test | **5.33%** | 1.46% | 11,266 | ## Training Fine-tuned on a combination of: - **[Common Voice 17.0](https://huggingface.co/datasets/fixie-ai/common_voice_17_0)** (nl) -- human-recorded Dutch speech - **[Synthetic Transcript NL](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl)** -- 34,898 synthetic Dutch speech samples generated with OpenAI TTS ### Training Configuration | Parameter | Value | |---|---| | Optimizer | AdamW | | Learning rate | 5e-5 (cosine annealing) | | Warmup | 10% of total steps | | Batch size | 64 | | Precision | bf16-mixed | | Gradient clipping | 1.0 | | Early stopping | 10 epochs patience on val WER | | Best epoch | 21 | ## Usage ### Installation ```bash pip install nemo_toolkit[asr] ``` ### Transcribe Audio ```python import nemo.collections.asr as nemo_asr # Load model asr_model = nemo_asr.models.ASRModel.from_pretrained( model_name="yuriyvnv/parakeet-tdt-0.6b-dutch" ) # Transcribe output = asr_model.transcribe(["audio.wav"]) print(output[0].text) ``` ### Transcribe with Timestamps ```python output = asr_model.transcribe(["audio.wav"], timestamps=True) for stamp in output[0].timestamp["segment"]: print(f"{stamp['start']:.1f}s - {stamp['end']:.1f}s : {stamp['segment']}") ``` ### Long-Form Audio For audio longer than 24 minutes, enable local attention: ```python asr_model.change_attention_model( self_attention_model="rel_pos_local_attn", att_context_size=[256, 256], ) output = asr_model.transcribe(["long_audio.wav"]) ``` ## Intended Use This model is designed for transcribing Dutch speech to text. It works best on: - Read speech and conversational Dutch - Audio recorded at 16 kHz or higher - Segments up to 24 minutes (or longer with local attention enabled) ## Limitations - Trained primarily on European Portuguese-accented Dutch from Common Voice; performance may vary on regional dialects or heavily accented speech - Synthetic training data was generated with OpenAI TTS voices, which may not fully represent natural speech variability - Not suitable for real-time streaming without additional configuration