--- license: apache-2.0 language: - sr base_model: - openai/whisper-small datasets: - google/fleurs - Sagicc/audio-lmb-ds - espnet/yodas_owsmv4 - classla/ParlaSpeech-RS metrics: - wer model-index: - name: Whisper Small results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 24.0 type: mozilla-foundation/common_voice_24_0 config: sr split: test args: sr metrics: - name: Wer type: wer value: 0.065924219787 library_name: transformers --- # whisper-small-sr Fine-tuned **OpenAI Whisper Small**. **Output script:** this model is intended to produce **Serbian Latin** only. - **WER** on Common Voice 24.0 Serbian test: **6.59%** ## Model description ## Training and evaluation data This model was fine-tuned on a **mixture of publicly available Serbian speech corpora**, including: - Mozilla Common Voice 24.0, evaluated on **CV test (sr)** - FLEURS Serbian - ParlaSpeech-RS (subset of the full dataset) - Additional Serbian corpora used in the training pipeline ## Training procedure - Epochs: 9 - Batch size: 32 / 20 - Optimizer: AdamW - LR: 6e-5 with warmup (50 steps) + cosine decay to min_lr = 1e-7 - Mixed precision: bfloat16 (fp32 in the final epoch) - SpecAugment: frequency + time masking - Sampling: weighted sampling across datasets ### Training results | Epoch | Train loss | CV WER | |------:|------------------:|-------:| | 1 | 0.333 | 0.1614 | | 2 | 0.344 | 0.1278 | | 3 | 0.251 | 0.1112 | | 4 | 0.202 | 0.1032 | | 5 | 0.167 | 0.0934 | | 6 | 0.138 | 0.0790 | | 7 | 0.118 | 0.0740 | | 8 | 0.103 | 0.0709 | | 9 | 0.096 | 0.0659 | ## Evaluation Metrics - **WER (normalized)** on **Common Voice 24.0 Serbian test**: **7.09%** - Text normalization used for WER: - punctuation removed - lowercased - Cyrillic → Latin conversion - numbers converted to words