Instructions to use istomin9192/whisper-small-sr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use istomin9192/whisper-small-sr with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="istomin9192/whisper-small-sr")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("istomin9192/whisper-small-sr") model = AutoModelForMultimodalLM.from_pretrained("istomin9192/whisper-small-sr") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - sr | |
| base_model: | |
| - openai/whisper-small | |
| datasets: | |
| - google/fleurs | |
| - Sagicc/audio-lmb-ds | |
| - espnet/yodas_owsmv4 | |
| - classla/ParlaSpeech-RS | |
| metrics: | |
| - wer | |
| model-index: | |
| - name: Whisper Small | |
| results: | |
| - task: | |
| name: Automatic Speech Recognition | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: Common Voice 24.0 | |
| type: mozilla-foundation/common_voice_24_0 | |
| config: sr | |
| split: test | |
| args: sr | |
| metrics: | |
| - name: Wer | |
| type: wer | |
| value: 0.065924219787 | |
| library_name: transformers | |
| # whisper-small-sr | |
| Fine-tuned **OpenAI Whisper Small**. | |
| **Output script:** this model is intended to produce **Serbian Latin** only. | |
| - **WER** on Common Voice 24.0 Serbian test: **6.59%** | |
| ## Model description | |
| ## Training and evaluation data | |
| This model was fine-tuned on a **mixture of publicly available Serbian speech corpora**, including: | |
| - Mozilla Common Voice 24.0, evaluated on **CV test (sr)** | |
| - FLEURS Serbian | |
| - ParlaSpeech-RS (subset of the full dataset) | |
| - Additional Serbian corpora used in the training pipeline | |
| ## Training procedure | |
| - Epochs: 9 | |
| - Batch size: 32 / 20 | |
| - Optimizer: AdamW | |
| - LR: 6e-5 with warmup (50 steps) + cosine decay to min_lr = 1e-7 | |
| - Mixed precision: bfloat16 (fp32 in the final epoch) | |
| - SpecAugment: frequency + time masking | |
| - Sampling: weighted sampling across datasets | |
| ### Training results | |
| | Epoch | Train loss | CV WER | | |
| |------:|------------------:|-------:| | |
| | 1 | 0.333 | 0.1614 | | |
| | 2 | 0.344 | 0.1278 | | |
| | 3 | 0.251 | 0.1112 | | |
| | 4 | 0.202 | 0.1032 | | |
| | 5 | 0.167 | 0.0934 | | |
| | 6 | 0.138 | 0.0790 | | |
| | 7 | 0.118 | 0.0740 | | |
| | 8 | 0.103 | 0.0709 | | |
| | 9 | 0.096 | 0.0659 | | |
| ## Evaluation Metrics | |
| - **WER (normalized)** on **Common Voice 24.0 Serbian test**: **7.09%** | |
| - Text normalization used for WER: | |
| - punctuation removed | |
| - lowercased | |
| - Cyrillic → Latin conversion | |
| - numbers converted to words | |