---
library_name: peft
license: apache-2.0
datasets:
- PogusTheWhisper/tingwry-asr-th_th-noise-augmented
language:
- th
- en
metrics:
- cer
base_model:
- nectec/Pathumma-whisper-th-large-v3
pipeline_tag: automatic-speech-recognition
---

# Pathumma Whisper Large V3 (TH) — Natural Noise-Robust Finetuned (v4, LoRA)

## Model Description

This model is a Thai Automatic Speech Recognition (ASR) system based on [`nectec/Pathumma-whisper-th-large-v3`](https://huggingface.co/nectec/Pathumma-whisper-th-large-v3), enhanced with LoRA (Low-Rank Adaptation) fine-tuning to improve robustness in noisy environments.

It uses `WhisperForConditionalGeneration` with SpecAugment and gradient checkpointing to improve performance on real-world noisy and spontaneous Thai speech. Training was done on a custom dataset simulating voice messages, ambient sound, and conversational noise.

---

## Dataset

- **Name**: [`tingwry/asr-augmented`](https://huggingface.co/datasets/tingwry/asr-augmented)  
- **Description**: Thai ASR dataset augmented with realistic background noise (e.g., voice messages, ambient environments) to simulate common recording conditions.

---

## Quickstart

```python
import torch
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

lang = "th"
task = "transcribe"

pipe = pipeline(
    task="automatic-speech-recognition",
    model="PogusTheWhisper/Pathumma-whisper-th-large-v3-natural-noise-finetuned",
    device=device,
    torch_dtype=torch_dtype,
    chunk_length_s=30,
    return_timestamps=False
)

pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task=task)

audio_path = "<Your wav file>"
result = pipe(audio_path)

print("Full Transcription:\n", result["text"])
```

---

## Model Architecture

- Base Model: `nectec/Pathumma-whisper-th-large-v3`
- Adapter Type: LoRA
- Target Modules: `q_proj`, `k_proj`, `v_proj`
- LoRA Config:
  - `r=8`
  - `lora_alpha=32`
  - `lora_dropout=0.1`

### SpecAugment

- `mask_time_prob = 0.2`
- `mask_feature_prob = 0.2`

---

## Training Arguments

- Epochs: 8
- Learning Rate: `2e-5`
- Scheduler: Cosine
- Warmup Ratio: `0.05`
- Batch Size: 4 (per device)
- Precision: bf16
- Optimizer: AdamW (fused)
- Gradient Checkpointing: Enabled
- Metric: CER
- Generation Max Length: 256
- Generation Beams: 5

---

## Training Results
| Epoch | Training Loss | Validation Loss | CER     | WER      |
|-------|----------------|------------------|---------|----------|
| 1     | 0.049300       | 0.022428         | 0.052511| 0.124607 |
| 2     | 0.017500       | 0.015223         | 0.051452| 0.100236 |
| 3     | 0.012900       | 0.012217         | 0.049419| 0.092767 |
| 4     | 0.009900       | 0.010561         | 0.049024| 0.091588 |
| 5     | 0.007500       | 0.010173         | 0.048868| 0.087657 |
| 6     | 0.007200       | 0.009647         | 0.050930| 0.086478 |
| 7     | 0.006700       | 0.009532         | 0.051565| 0.087264 |
| 8     | 0.006400       | 0.009492         | 0.047598| 0.086478 |

---

## Evaluation Performance (in Percentage)

### CER

| model                                                |   samples |   SEACrowd/gowajee |   SEACrowd/thai_elderly_speech |   fsicoli/common_voice_18_0 |   google/fleurs |   tingwry/asr-augmented |
|:-----------------------------------------------------|----------:|-------------------:|-------------------------------:|----------------------------:|----------------:|------------------------:|
| whisper-large-v3                                     |       388 |              37.82 |                           5.24 |                        9.25 |           10.95 |                    4.58 |
| pathumma-whisper-th-large-v3-natural-noise-finetuned |       388 |               2.18 |                           0.84 |                        4.73 |            7.21 |                    1.3  |
| airesearch-wav2vec2-large-xlsr-53-th                 |       388 |              30.31 |                           3.83 |                        6.49 |           12.84 |                    8.19 |
| pathumma-whisper-th-large-v3                         |       388 |               1.27 |                           0.5  |                        4.75 |            7.39 |                    4.57 |
| monsoon-whisper-medium-gigaspeech2                   |       388 |              30.31 |                           3.83 |                        6.49 |           12.84 |                    8.19 |
| thonburian-whisper-th-large-v3-combined              |       388 |               8.61 |                           0.81 |                        5.8  |            7.45 |                    2.71 |

### WER

| model                                                |   samples |   SEACrowd/gowajee |   SEACrowd/thai_elderly_speech |   fsicoli/common_voice_18_0 |   google/fleurs |   tingwry/asr-augmented |
|:-----------------------------------------------------|----------:|-------------------:|-------------------------------:|----------------------------:|----------------:|------------------------:|
| whisper-large-v3                                     |       388 |              94.1  |                          96.91 |                       78.84 |           87.97 |                   74.12 |
| pathumma-whisper-th-large-v3-natural-noise-finetuned |       388 |               8.23 |                          19.33 |                       69.1  |           69.39 |                    7.15 |
| airesearch-wav2vec2-large-xlsr-53-th                 |       388 |              99.58 |                          38.92 |                       67.79 |           99.63 |                  100    |
| pathumma-whisper-th-large-v3                         |       388 |               4.37 |                           5.41 |                       80.34 |           71.13 |                   90.02 |
| monsoon-whisper-medium-gigaspeech2                   |       388 |              99.58 |                          38.92 |                       67.79 |           99.63 |                  100    |
| thonburian-whisper-th-large-v3-combined              |       388 |              39.84 |                          11.08 |                      110.67 |           66.33 |                   49.85 |

---

## Limitations and Future Work

- Trained on Thai-only speech, not multilingual
- Evaluated using CER, WER ; Thai word segmentation metrics will be explored in future versions
- May not generalize well to regional dialects or highly degraded audio
- Future improvements may include domain adaptation (e.g., medical, legal) and dialect-specific tuning

---

## Acknowledgements

- NECTEC for the original base model  
- OpenAI for Whisper architecture  
- SuperAI Engineer Program for mentors support  
- ThaiSC (NSTDA Supercomputer Center) for GPU compute on LANTA cluster
- Special thanks to P'Tik, P'Joe, P'Sam, P'nut and P'Earth
- And The Scamper SS5 House

---

*Built with `peft==0.15.2` and `transformers==4.x`*