🎙️ Arabic Phoneme ASR — wav2vec2-large-xlsr-53

Fine-tuned facebook/wav2vec2-large-xlsr-53 for phoneme-level Arabic speech recognition using CTC, trained on the tunis-ai/arabic_speech_corpus dataset.

Model Description

This model transcribes Arabic speech directly into a sequence of phoneme tokens rather than graphemes or words. It is intended for linguistic analysis, pronunciation research, and downstream tasks that benefit from sub-word acoustic representations.

The base model's convolutional feature encoder is frozen; only the transformer layers and a freshly initialized CTC head (61 phoneme tokens) are fine-tuned.

Training Details

Parameter	Value
Base model	`facebook/wav2vec2-large-xlsr-53`
Dataset	`tunis-ai/arabic_speech_corpus`
Target label	`phonetic` column (phoneme sequences)
Sample rate	16 kHz
Epochs	40 (early stopping, patience = 10)
Effective batch size	16 (2 × 8 gradient accumulation)
Learning rate	1e-4
Warmup steps	500
Precision	fp16

Data Splits

The original train/test splits were pooled and re-split deterministically (seed 42):

Split	Samples	%
Train	1,530	80%
Validation	191	10%
Test	192	10%

Preprocessing

Audio resampled to 16 kHz
Phoneme sequences cleaned: punctuation removed, lowercased, whitespace normalised
Gemination (doubled tokens like bb, dd) mapped to their single form to reduce vocabulary size
Noise/distortion marker dist mapped to sil
Phonemes joined with | (word delimiter token) before tokenisation to prevent character-level splitting

Vocabulary

The model uses a 59-phoneme vocabulary plus three special tokens (|, [UNK], [PAD]), for a total of 61 tokens.

Phoneme inventory after normalisation:

$ $$ * ** - ^ ^^ a a' aa aa' ah b d e ee f g h i i0 i0' i1 i1'
ii0 ii0' ii1 ii1' j k l m n p pp q r s sh sil t th u u0 u0' u1
u1' uu0 uu0' uu1 uu1' v w x y z

Evaluation Results

Evaluated on the held-out test set (192 samples):

Metric	Score
PER (Phoneme Error Rate)	3.95%
CER (Character Error Rate)	2.93%
Test Loss	0.0505

Training Curve Summary

The model converged rapidly:

Epochs 1–4: high loss, near-random output
Epoch 5: PER drops from ~95% to ~38%
Epoch 6: PER drops to 7.5%
Epochs 7–40: steady improvement to final PER of 3.95%

Error Analysis

The most common error types (on 100 test samples) are vowel length and stress confusions — the model occasionally substitutes short vowels for long ones and vice versa. No consonant confusions appear in the top-15 substitutions.

Top substitutions (reference → prediction):

Reference	Predicted	Count
`a`	`aa`	9×
`a`	`a'`	7×
`a'`	`a`	5×
`aa'`	`aa`	5×
`i0`	`ii0`	5×

Most common deletion: sil boundary tokens (96× in 100 samples). These are silence/pause markers at utterance boundaries; their deletion has minimal impact on phonemic content.

Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch

model_path = "path/to/wav2vec2-arabic-phoneme-final"

processor = Wav2Vec2Processor.from_pretrained(model_path)
model = Wav2Vec2ForCTC.from_pretrained(model_path)
model.eval()

# audio_array: numpy array, 16 kHz
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])

# Convert delimiter back to spaces
phonemes = transcription.replace("|", " ").strip()
print(phonemes)
# e.g. → "sil w a r a' jj a h a tt a q r ii0' r u0 ..."

Limitations

Trained on a single corpus (1,913 utterances); performance on out-of-domain speakers or dialects may degrade.
The model over-generates sil and $ tokens at utterance boundaries; strip trailing occurrences in post-processing if needed.
Vowel length and stress diacritics (', 0, 1) represent the most frequent error class — tasks that do not distinguish length/stress can benefit from collapsing these distinctions in post-processing.
The dataset's text column contains Buckwalter transliteration, not Arabic Unicode; this model is trained solely on the phonetic column and produces phoneme sequences, not orthographic Arabic script.

Citation

If you use this model, please cite the base model and dataset:

@misc{conneau2020unsupervised,
  title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
  author={Conneau, Alexis and others},
  year={2020},
  eprint={2006.13979},
  archivePrefix={arXiv}
}

License

Inherits the license of facebook/wav2vec2-large-xlsr-53. Please review the original model card before use in commercial applications.

Downloads last month: 6

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for MostafaMaroof/wav2vec2-arabic-phoneme-asr

Base model

facebook/wav2vec2-large-xlsr-53

Finetuned

(363)

this model

Dataset used to train MostafaMaroof/wav2vec2-arabic-phoneme-asr

Paper for MostafaMaroof/wav2vec2-arabic-phoneme-asr

Unsupervised Cross-lingual Representation Learning for Speech Recognition

Paper • 2006.13979 • Published Jun 24, 2020 • 2

Evaluation results

PER (Phoneme Error Rate) on tunis-ai/arabic_speech_corpus
test set self-reported

0.040
CER on tunis-ai/arabic_speech_corpus
test set self-reported

0.029