istomin9192
/

whisper-small-sr

Automatic Speech Recognition

Eval Results (legacy)

Model card Files Files and versions

whisper-small-sr / README.md

istomin9192's picture

Update README.md

1bee0fd verified 4 months ago

|

history blame contribute delete

1.92 kB

	---
	license: apache-2.0
	language:
	- sr
	base_model:
	- openai/whisper-small
	datasets:
	- google/fleurs
	- Sagicc/audio-lmb-ds
	- espnet/yodas_owsmv4
	- classla/ParlaSpeech-RS
	metrics:
	- wer
	model-index:
	- name: Whisper Small
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 24.0
	type: mozilla-foundation/common_voice_24_0
	config: sr
	split: test
	args: sr
	metrics:
	- name: Wer
	type: wer
	value: 0.065924219787
	library_name: transformers
	---

	# whisper-small-sr

	Fine-tuned OpenAI Whisper Small.

	Output script: this model is intended to produce Serbian Latin only.

	- WER on Common Voice 24.0 Serbian test: 6.59%

	## Model description


	## Training and evaluation data

	This model was fine-tuned on a mixture of publicly available Serbian speech corpora, including:

	- Mozilla Common Voice 24.0, evaluated on CV test (sr)
	- FLEURS Serbian
	- ParlaSpeech-RS (subset of the full dataset)
	- Additional Serbian corpora used in the training pipeline


	## Training procedure

	- Epochs: 9
	- Batch size: 32 / 20
	- Optimizer: AdamW
	- LR: 6e-5 with warmup (50 steps) + cosine decay to min_lr = 1e-7
	- Mixed precision: bfloat16 (fp32 in the final epoch)
	- SpecAugment: frequency + time masking
	- Sampling: weighted sampling across datasets

	### Training results

	\| Epoch \| Train loss \| CV WER \|
	\|------:\|------------------:\|-------:\|
	\| 1 \| 0.333 \| 0.1614 \|
	\| 2 \| 0.344 \| 0.1278 \|
	\| 3 \| 0.251 \| 0.1112 \|
	\| 4 \| 0.202 \| 0.1032 \|
	\| 5 \| 0.167 \| 0.0934 \|
	\| 6 \| 0.138 \| 0.0790 \|
	\| 7 \| 0.118 \| 0.0740 \|
	\| 8 \| 0.103 \| 0.0709 \|
	\| 9 \| 0.096 \| 0.0659 \|

	## Evaluation Metrics

	- WER (normalized) on Common Voice 24.0 Serbian test: 7.09%
	- Text normalization used for WER:
	- punctuation removed
	- lowercased
	- Cyrillic → Latin conversion
	- numbers converted to words