Add test set results (WER/CER) and improve metadata

6c413e4 verified about 2 months ago

4.17 kB

	---
	language:
	- nl
	license: cc-by-4.0
	library_name: nemo
	tags:
	- automatic-speech-recognition
	- speech
	- nemo
	- parakeet
	- fastconformer
	- tdt
	- dutch
	- nvidia
	- common-voice
	- synthetic-speech
	- fine-tuned
	datasets:
	- fixie-ai/common_voice_17_0
	- yuriyvnv/synthetic_transcript_nl
	base_model: nvidia/parakeet-tdt-0.6b-v3
	pipeline_tag: automatic-speech-recognition
	model-index:
	- name: parakeet-tdt-0.6b-dutch
	results:
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	name: Common Voice 17.0 (nl) - Validation
	type: fixie-ai/common_voice_17_0
	config: nl
	split: validation
	metrics:
	- type: wer
	value: 3.73
	name: Val WER
	- type: cer
	value: 1.02
	name: Val CER
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	name: Common Voice 17.0 (nl) - Test
	type: fixie-ai/common_voice_17_0
	config: nl
	split: test
	metrics:
	- type: wer
	value: 5.33
	name: Test WER
	- type: cer
	value: 1.46
	name: Test CER
	---

	# Parakeet-TDT-0.6B Dutch

	A Dutch automatic speech recognition (ASR) model fine-tuned from [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3).

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) \|
	\| Architecture \| FastConformer-TDT (600M params) \|
	\| Language \| Dutch (nl) \|
	\| Input \| 16 kHz mono audio \|
	\| Output \| Dutch text with punctuation and capitalization \|
	\| License \| CC-BY-4.0 \|

	## Evaluation Results

	Evaluated on [Common Voice 17.0](https://huggingface.co/datasets/fixie-ai/common_voice_17_0) Dutch splits (raw text, no normalization):

	\| Split \| WER \| CER \| Samples \|
	\|---\|---\|---\|---\|
	\| Validation \| 3.73% \| 1.02% \| 9,062 \|
	\| Test \| 5.33% \| 1.46% \| 11,266 \|

	## Training

	Fine-tuned on a combination of:

	- [Common Voice 17.0](https://huggingface.co/datasets/fixie-ai/common_voice_17_0) (nl) -- human-recorded Dutch speech
	- [Synthetic Transcript NL](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) -- 34,898 synthetic Dutch speech samples generated with OpenAI TTS

	### Training Configuration

	\| Parameter \| Value \|
	\|---\|---\|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 5e-5 (cosine annealing) \|
	\| Warmup \| 10% of total steps \|
	\| Batch size \| 64 \|
	\| Precision \| bf16-mixed \|
	\| Gradient clipping \| 1.0 \|
	\| Early stopping \| 10 epochs patience on val WER \|
	\| Best epoch \| 21 \|

	## Usage

	### Installation

	```bash
	pip install nemo_toolkit[asr]
	```

	### Transcribe Audio

	```python
	import nemo.collections.asr as nemo_asr

	# Load model
	asr_model = nemo_asr.models.ASRModel.from_pretrained(
	model_name="yuriyvnv/parakeet-tdt-0.6b-dutch"
	)

	# Transcribe
	output = asr_model.transcribe(["audio.wav"])
	print(output[0].text)
	```

	### Transcribe with Timestamps

	```python
	output = asr_model.transcribe(["audio.wav"], timestamps=True)

	for stamp in output[0].timestamp["segment"]:
	print(f"{stamp['start']:.1f}s - {stamp['end']:.1f}s : {stamp['segment']}")
	```

	### Long-Form Audio

	For audio longer than 24 minutes, enable local attention:

	```python
	asr_model.change_attention_model(
	self_attention_model="rel_pos_local_attn",
	att_context_size=[256, 256],
	)
	output = asr_model.transcribe(["long_audio.wav"])
	```

	## Intended Use

	This model is designed for transcribing Dutch speech to text. It works best on:
	- Read speech and conversational Dutch
	- Audio recorded at 16 kHz or higher
	- Segments up to 24 minutes (or longer with local attention enabled)

	## Limitations

	- Trained primarily on European Portuguese-accented Dutch from Common Voice; performance may vary on regional dialects or heavily accented speech
	- Synthetic training data was generated with OpenAI TTS voices, which may not fully represent natural speech variability
	- Not suitable for real-time streaming without additional configuration