🎙️ Arabic Phoneme ASR — wav2vec2-large-xlsr-53
Fine-tuned facebook/wav2vec2-large-xlsr-53 for phoneme-level Arabic speech recognition using CTC, trained on the tunis-ai/arabic_speech_corpus dataset.
Model Description
This model transcribes Arabic speech directly into a sequence of phoneme tokens rather than graphemes or words. It is intended for linguistic analysis, pronunciation research, and downstream tasks that benefit from sub-word acoustic representations.
The base model's convolutional feature encoder is frozen; only the transformer layers and a freshly initialized CTC head (61 phoneme tokens) are fine-tuned.
Training Details
| Parameter | Value |
|---|---|
| Base model | facebook/wav2vec2-large-xlsr-53 |
| Dataset | tunis-ai/arabic_speech_corpus |
| Target label | phonetic column (phoneme sequences) |
| Sample rate | 16 kHz |
| Epochs | 40 (early stopping, patience = 10) |
| Effective batch size | 16 (2 × 8 gradient accumulation) |
| Learning rate | 1e-4 |
| Warmup steps | 500 |
| Precision | fp16 |
Data Splits
The original train/test splits were pooled and re-split deterministically (seed 42):
| Split | Samples | % |
|---|---|---|
| Train | 1,530 | 80% |
| Validation | 191 | 10% |
| Test | 192 | 10% |
Preprocessing
- Audio resampled to 16 kHz
- Phoneme sequences cleaned: punctuation removed, lowercased, whitespace normalised
- Gemination (doubled tokens like
bb,dd) mapped to their single form to reduce vocabulary size - Noise/distortion marker
distmapped tosil - Phonemes joined with
|(word delimiter token) before tokenisation to prevent character-level splitting
Vocabulary
The model uses a 59-phoneme vocabulary plus three special tokens (|, [UNK], [PAD]), for a total of 61 tokens.
Phoneme inventory after normalisation:
$ $$ * ** - ^ ^^ a a' aa aa' ah b d e ee f g h i i0 i0' i1 i1'
ii0 ii0' ii1 ii1' j k l m n p pp q r s sh sil t th u u0 u0' u1
u1' uu0 uu0' uu1 uu1' v w x y z
Evaluation Results
Evaluated on the held-out test set (192 samples):
| Metric | Score |
|---|---|
| PER (Phoneme Error Rate) | 3.95% |
| CER (Character Error Rate) | 2.93% |
| Test Loss | 0.0505 |
Training Curve Summary
The model converged rapidly:
- Epochs 1–4: high loss, near-random output
- Epoch 5: PER drops from ~95% to ~38%
- Epoch 6: PER drops to 7.5%
- Epochs 7–40: steady improvement to final PER of 3.95%
Error Analysis
The most common error types (on 100 test samples) are vowel length and stress confusions — the model occasionally substitutes short vowels for long ones and vice versa. No consonant confusions appear in the top-15 substitutions.
Top substitutions (reference → prediction):
| Reference | Predicted | Count |
|---|---|---|
a |
aa |
9× |
a |
a' |
7× |
a' |
a |
5× |
aa' |
aa |
5× |
i0 |
ii0 |
5× |
Most common deletion: sil boundary tokens (96× in 100 samples). These are silence/pause markers at utterance boundaries; their deletion has minimal impact on phonemic content.
Usage
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
model_path = "path/to/wav2vec2-arabic-phoneme-final"
processor = Wav2Vec2Processor.from_pretrained(model_path)
model = Wav2Vec2ForCTC.from_pretrained(model_path)
model.eval()
# audio_array: numpy array, 16 kHz
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])
# Convert delimiter back to spaces
phonemes = transcription.replace("|", " ").strip()
print(phonemes)
# e.g. → "sil w a r a' jj a h a tt a q r ii0' r u0 ..."
Limitations
- Trained on a single corpus (1,913 utterances); performance on out-of-domain speakers or dialects may degrade.
- The model over-generates
siland$tokens at utterance boundaries; strip trailing occurrences in post-processing if needed. - Vowel length and stress diacritics (
',0,1) represent the most frequent error class — tasks that do not distinguish length/stress can benefit from collapsing these distinctions in post-processing. - The dataset's
textcolumn contains Buckwalter transliteration, not Arabic Unicode; this model is trained solely on thephoneticcolumn and produces phoneme sequences, not orthographic Arabic script.
Citation
If you use this model, please cite the base model and dataset:
@misc{conneau2020unsupervised,
title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
author={Conneau, Alexis and others},
year={2020},
eprint={2006.13979},
archivePrefix={arXiv}
}
License
Inherits the license of facebook/wav2vec2-large-xlsr-53. Please review the original model card before use in commercial applications.
- Downloads last month
- 6
Model tree for MostafaMaroof/wav2vec2-arabic-phoneme-asr
Base model
facebook/wav2vec2-large-xlsr-53Dataset used to train MostafaMaroof/wav2vec2-arabic-phoneme-asr
Paper for MostafaMaroof/wav2vec2-arabic-phoneme-asr
Evaluation results
- PER (Phoneme Error Rate) on tunis-ai/arabic_speech_corpustest set self-reported0.040
- CER on tunis-ai/arabic_speech_corpustest set self-reported0.029