---
language:
- dv
- en
- ar
license: apache-2.0
tags:
- whisper
- dhivehi
- code-switching
- automatic-speech-recognition
base_model: openai/whisper-small
pipeline_tag: automatic-speech-recognition
---

# Whisper Dhivehi Code-Switching ASR

Whisper-small fine-tuned for code-switched Dhivehi (with English and Arabic).
Adds a custom `<|dv|>` language token to the tokenizer.

## Usage

~~~python
from transformers import pipeline

asr = pipeline(
    task="automatic-speech-recognition",
    model="Serialtechlab/whisper-dhivehi-code-switch-v2",
    device=0,
    chunk_length_s=10,
    stride_length_s=(1, 1),
    generate_kwargs={"num_beams": 3, "repetition_penalty": 1.05},
)

result = asr("audio.wav")
print(result["text"])
~~~

## Training data

Fine-tuned on a synthetic code-switched dataset combining:
- Dhivehi: Serialtechlab/dhivehi-mms-v5-combined, dhivehi-tts-preprocessed, dv-syn-female2-for-tts
- English/Arabic loan words: google/fleurs (en_us, ar_eg)

Trained for 20,000 steps from `whisper-small` base, with a custom `<|dv|>` language token added.