TheKingMonarch's picture
Update README.md
ca1e7ac verified
---
library_name: transformers
license: apache-2.0
language:
- en
- hi
- bn
- mr
- ta
- te
base_model: distil-whisper/distil-large-v3
tags:
- whisper
- speech-recognition
- multilingual
- automatic-speech-recognition
- hindi
- bengali
- marathi
- tamil
- telugu
- english
- distil-whisper
- indian-languages
datasets:
- custom-multilingual-dataset
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
model-index:
- name: whisper-multilang-finetuned
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
type: custom-multilingual-dataset
name: Custom Multilingual Dataset
metrics:
- type: wer
value: 27.08
name: Word Error Rate
- type: wer
value: 26.73
name: Best WER
widget:
- example_title: "Hindi Speech Recognition"
text: "मैं आज बाजार जा रहा हूं"
- example_title: "Bengali Speech Recognition"
text: "আমি আজ বাজারে যাচ্ছি"
- example_title: "English Speech Recognition"
text: "I am going to the market today"
---
# Whisper Multilingual Fine-tuned Model
This is a fine-tuned version of OpenAI's Whisper model for multilingual speech recognition.
## Supported Languages
- English (en)
- Hindi (hi)
- Bengali (bn)
- Marathi (mr)
- Tamil (ta)
- Telugu (te)
## Model Details
- **Base Model**: Distil Whisper Large V3
- **Fine-tuned on**: Custom multilingual dataset
- **Training Framework**: Transformers
- **Model Type**: Speech-to-Text
## Usage
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
# Load model and processor
processor = WhisperProcessor.from_pretrained("TheKingMonarch/whisper-multilang-finetuned")
model = WhisperForConditionalGeneration.from_pretrained("TheKingMonarch/whisper-multilang-finetuned")
# Fix generation config
model.generation_config.forced_decoder_ids = None
# Load audio
audio, _ = librosa.load("audio.wav", sr=16000)
# Transcribe
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
predicted_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
```
## Language-specific Usage
```python
# For specific language (e.g., Hindi)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="hi", task="transcribe")
predicted_ids = model.generate(inputs.input_features, forced_decoder_ids=forced_decoder_ids)
```
## Training Details
- Fine-tuned using custom multilingual speech dataset
- Optimized for Indian languages and English
- **Final WER**: 27.08%
- **Training Steps**: 600
- **Best WER achieved**: 26.73% at step 550
### Training Metrics
| Step | Training Loss | Validation Loss | WER (%) |
|------|---------------|-----------------|---------|
| 50 | 2.075000 | 1.930286 | 133.45 |
| 100 | 1.206600 | 1.275027 | 89.54 |
| 150 | 0.793800 | 0.712475 | 93.42 |
| 200 | 0.528700 | 0.562679 | 88.92 |
| 250 | 0.379900 | 0.473467 | 89.27 |
| 300 | 0.289400 | 0.369892 | 69.88 |
| 350 | 0.244300 | 0.291235 | 49.58 |
| 400 | 0.268800 | 0.249055 | 42.80 |
| 450 | 0.122200 | 0.209867 | 36.29 |
| 500 | 0.084700 | 0.173593 | 31.44 |
| 550 | 0.073400 | 0.155249 | **26.73** |
| 600 | 0.044300 | 0.148559 | 27.08 |
### Training Configuration
- **Base Model**: distil whispwer large v3
- **Learning Rate**: Optimized during training
- **Batch Size**: Configured for optimal performance
- **Training Duration**: 600 steps
- **Evaluation Strategy**: Every 50 steps
- **Early Stopping**: Based on WER improvement
## Limitations
- Performance may vary across different accents and dialects
- Best results on clear audio with minimal background noise
- Optimized for the specific languages listed above
## Citation
If you use this model, please cite:
```
@misc{{whisper-multilang-finetuned,
author = {{Your Name}},
title = {{Whisper Multilingual Fine-tuned Model}},
year = {{2025}},
publisher = {{Hugging Face}},
url = {{https://huggingface.co/TheKingMonarch/whisper-multilang-finetuned}}
}}
```