---
license: apache-2.0
language:
- sr
base_model:
- openai/whisper-small
datasets:
- google/fleurs
- Sagicc/audio-lmb-ds
- espnet/yodas_owsmv4
- classla/ParlaSpeech-RS
metrics:
- wer
model-index:
- name: Whisper Small
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 24.0
      type: mozilla-foundation/common_voice_24_0
      config: sr
      split: test
      args: sr
    metrics:
    - name: Wer
      type: wer
      value: 0.065924219787
library_name: transformers
---

# whisper-small-sr

Fine-tuned **OpenAI Whisper Small**.

**Output script:** this model is intended to produce **Serbian Latin** only.

- **WER** on Common Voice 24.0 Serbian test: **6.59%**

## Model description


## Training and evaluation data

This model was fine-tuned on a **mixture of publicly available Serbian speech corpora**, including:

- Mozilla Common Voice 24.0, evaluated on **CV test (sr)**
- FLEURS Serbian
- ParlaSpeech-RS (subset of the full dataset)
- Additional Serbian corpora used in the training pipeline


## Training procedure

- Epochs: 9
- Batch size: 32 / 20
- Optimizer: AdamW
- LR: 6e-5 with warmup (50 steps) + cosine decay to min_lr = 1e-7
- Mixed precision: bfloat16 (fp32 in the final epoch)
- SpecAugment: frequency + time masking
- Sampling: weighted sampling across datasets

### Training results

| Epoch | Train loss | CV WER |
|------:|------------------:|-------:|
| 1 | 0.333 | 0.1614 |
| 2 | 0.344 | 0.1278 |
| 3 | 0.251 | 0.1112 |
| 4 | 0.202 | 0.1032 |
| 5 | 0.167 | 0.0934 |
| 6 | 0.138 | 0.0790 |
| 7 | 0.118 | 0.0740 |
| 8 | 0.103 | 0.0709 |
| 9 | 0.096	| 0.0659 |

## Evaluation Metrics

- **WER (normalized)** on **Common Voice 24.0 Serbian test**: **7.09%**
- Text normalization used for WER:
  - punctuation removed
  - lowercased
  - Cyrillic → Latin conversion
  - numbers converted to words