--- library_name: peft license: apache-2.0 datasets: - PogusTheWhisper/tingwry-asr-th_th-noise-augmented language: - th - en metrics: - cer base_model: - nectec/Pathumma-whisper-th-large-v3 pipeline_tag: automatic-speech-recognition --- # Pathumma Whisper Large V3 (TH) — Natural Noise-Robust Finetuned (v4, LoRA) ## Model Description This model is a Thai Automatic Speech Recognition (ASR) system based on [`nectec/Pathumma-whisper-th-large-v3`](https://huggingface.co/nectec/Pathumma-whisper-th-large-v3), enhanced with LoRA (Low-Rank Adaptation) fine-tuning to improve robustness in noisy environments. It uses `WhisperForConditionalGeneration` with SpecAugment and gradient checkpointing to improve performance on real-world noisy and spontaneous Thai speech. Training was done on a custom dataset simulating voice messages, ambient sound, and conversational noise. --- ## Dataset - **Name**: [`tingwry/asr-augmented`](https://huggingface.co/datasets/tingwry/asr-augmented) - **Description**: Thai ASR dataset augmented with realistic background noise (e.g., voice messages, ambient environments) to simulate common recording conditions. --- ## Quickstart ```python import torch from transformers import pipeline device = "cuda" if torch.cuda.is_available() else "cpu" torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 lang = "th" task = "transcribe" pipe = pipeline( task="automatic-speech-recognition", model="PogusTheWhisper/Pathumma-whisper-th-large-v3-natural-noise-finetuned", device=device, torch_dtype=torch_dtype, chunk_length_s=30, return_timestamps=False ) pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task=task) audio_path = "" result = pipe(audio_path) print("Full Transcription:\n", result["text"]) ``` --- ## Model Architecture - Base Model: `nectec/Pathumma-whisper-th-large-v3` - Adapter Type: LoRA - Target Modules: `q_proj`, `k_proj`, `v_proj` - LoRA Config: - `r=8` - `lora_alpha=32` - `lora_dropout=0.1` ### SpecAugment - `mask_time_prob = 0.2` - `mask_feature_prob = 0.2` --- ## Training Arguments - Epochs: 8 - Learning Rate: `2e-5` - Scheduler: Cosine - Warmup Ratio: `0.05` - Batch Size: 4 (per device) - Precision: bf16 - Optimizer: AdamW (fused) - Gradient Checkpointing: Enabled - Metric: CER - Generation Max Length: 256 - Generation Beams: 5 --- ## Training Results | Epoch | Training Loss | Validation Loss | CER | WER | |-------|----------------|------------------|---------|----------| | 1 | 0.049300 | 0.022428 | 0.052511| 0.124607 | | 2 | 0.017500 | 0.015223 | 0.051452| 0.100236 | | 3 | 0.012900 | 0.012217 | 0.049419| 0.092767 | | 4 | 0.009900 | 0.010561 | 0.049024| 0.091588 | | 5 | 0.007500 | 0.010173 | 0.048868| 0.087657 | | 6 | 0.007200 | 0.009647 | 0.050930| 0.086478 | | 7 | 0.006700 | 0.009532 | 0.051565| 0.087264 | | 8 | 0.006400 | 0.009492 | 0.047598| 0.086478 | --- ## Evaluation Performance (in Percentage) ### CER | model | samples | SEACrowd/gowajee | SEACrowd/thai_elderly_speech | fsicoli/common_voice_18_0 | google/fleurs | tingwry/asr-augmented | |:-----------------------------------------------------|----------:|-------------------:|-------------------------------:|----------------------------:|----------------:|------------------------:| | whisper-large-v3 | 388 | 37.82 | 5.24 | 9.25 | 10.95 | 4.58 | | pathumma-whisper-th-large-v3-natural-noise-finetuned | 388 | 2.18 | 0.84 | 4.73 | 7.21 | 1.3 | | airesearch-wav2vec2-large-xlsr-53-th | 388 | 30.31 | 3.83 | 6.49 | 12.84 | 8.19 | | pathumma-whisper-th-large-v3 | 388 | 1.27 | 0.5 | 4.75 | 7.39 | 4.57 | | monsoon-whisper-medium-gigaspeech2 | 388 | 30.31 | 3.83 | 6.49 | 12.84 | 8.19 | | thonburian-whisper-th-large-v3-combined | 388 | 8.61 | 0.81 | 5.8 | 7.45 | 2.71 | ### WER | model | samples | SEACrowd/gowajee | SEACrowd/thai_elderly_speech | fsicoli/common_voice_18_0 | google/fleurs | tingwry/asr-augmented | |:-----------------------------------------------------|----------:|-------------------:|-------------------------------:|----------------------------:|----------------:|------------------------:| | whisper-large-v3 | 388 | 94.1 | 96.91 | 78.84 | 87.97 | 74.12 | | pathumma-whisper-th-large-v3-natural-noise-finetuned | 388 | 8.23 | 19.33 | 69.1 | 69.39 | 7.15 | | airesearch-wav2vec2-large-xlsr-53-th | 388 | 99.58 | 38.92 | 67.79 | 99.63 | 100 | | pathumma-whisper-th-large-v3 | 388 | 4.37 | 5.41 | 80.34 | 71.13 | 90.02 | | monsoon-whisper-medium-gigaspeech2 | 388 | 99.58 | 38.92 | 67.79 | 99.63 | 100 | | thonburian-whisper-th-large-v3-combined | 388 | 39.84 | 11.08 | 110.67 | 66.33 | 49.85 | --- ## Limitations and Future Work - Trained on Thai-only speech, not multilingual - Evaluated using CER, WER ; Thai word segmentation metrics will be explored in future versions - May not generalize well to regional dialects or highly degraded audio - Future improvements may include domain adaptation (e.g., medical, legal) and dialect-specific tuning --- ## Acknowledgements - NECTEC for the original base model - OpenAI for Whisper architecture - SuperAI Engineer Program for mentors support - ThaiSC (NSTDA Supercomputer Center) for GPU compute on LANTA cluster - Special thanks to P'Tik, P'Joe, P'Sam, P'nut and P'Earth - And The Scamper SS5 House --- *Built with `peft==0.15.2` and `transformers==4.x`*