🎙️ Arabic Phoneme ASR — wav2vec2-large-xlsr-53

Fine-tuned facebook/wav2vec2-large-xlsr-53 for phoneme-level Arabic speech recognition using CTC, trained on the tunis-ai/arabic_speech_corpus dataset.


Model Description

This model transcribes Arabic speech directly into a sequence of phoneme tokens rather than graphemes or words. It is intended for linguistic analysis, pronunciation research, and downstream tasks that benefit from sub-word acoustic representations.

The base model's convolutional feature encoder is frozen; only the transformer layers and a freshly initialized CTC head (61 phoneme tokens) are fine-tuned.


Training Details

Parameter Value
Base model facebook/wav2vec2-large-xlsr-53
Dataset tunis-ai/arabic_speech_corpus
Target label phonetic column (phoneme sequences)
Sample rate 16 kHz
Epochs 40 (early stopping, patience = 10)
Effective batch size 16 (2 × 8 gradient accumulation)
Learning rate 1e-4
Warmup steps 500
Precision fp16

Data Splits

The original train/test splits were pooled and re-split deterministically (seed 42):

Split Samples %
Train 1,530 80%
Validation 191 10%
Test 192 10%

Preprocessing

  • Audio resampled to 16 kHz
  • Phoneme sequences cleaned: punctuation removed, lowercased, whitespace normalised
  • Gemination (doubled tokens like bb, dd) mapped to their single form to reduce vocabulary size
  • Noise/distortion marker dist mapped to sil
  • Phonemes joined with | (word delimiter token) before tokenisation to prevent character-level splitting

Vocabulary

The model uses a 59-phoneme vocabulary plus three special tokens (|, [UNK], [PAD]), for a total of 61 tokens.

Phoneme inventory after normalisation:

$ $$ * ** - ^ ^^ a a' aa aa' ah b d e ee f g h i i0 i0' i1 i1'
ii0 ii0' ii1 ii1' j k l m n p pp q r s sh sil t th u u0 u0' u1
u1' uu0 uu0' uu1 uu1' v w x y z

Evaluation Results

Evaluated on the held-out test set (192 samples):

Metric Score
PER (Phoneme Error Rate) 3.95%
CER (Character Error Rate) 2.93%
Test Loss 0.0505

Training Curve Summary

The model converged rapidly:

  • Epochs 1–4: high loss, near-random output
  • Epoch 5: PER drops from ~95% to ~38%
  • Epoch 6: PER drops to 7.5%
  • Epochs 7–40: steady improvement to final PER of 3.95%

Error Analysis

The most common error types (on 100 test samples) are vowel length and stress confusions — the model occasionally substitutes short vowels for long ones and vice versa. No consonant confusions appear in the top-15 substitutions.

Top substitutions (reference → prediction):

Reference Predicted Count
a aa
a a'
a' a
aa' aa
i0 ii0

Most common deletion: sil boundary tokens (96× in 100 samples). These are silence/pause markers at utterance boundaries; their deletion has minimal impact on phonemic content.


Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch

model_path = "path/to/wav2vec2-arabic-phoneme-final"

processor = Wav2Vec2Processor.from_pretrained(model_path)
model = Wav2Vec2ForCTC.from_pretrained(model_path)
model.eval()

# audio_array: numpy array, 16 kHz
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])

# Convert delimiter back to spaces
phonemes = transcription.replace("|", " ").strip()
print(phonemes)
# e.g. → "sil w a r a' jj a h a tt a q r ii0' r u0 ..."

Limitations

  • Trained on a single corpus (1,913 utterances); performance on out-of-domain speakers or dialects may degrade.
  • The model over-generates sil and $ tokens at utterance boundaries; strip trailing occurrences in post-processing if needed.
  • Vowel length and stress diacritics (', 0, 1) represent the most frequent error class — tasks that do not distinguish length/stress can benefit from collapsing these distinctions in post-processing.
  • The dataset's text column contains Buckwalter transliteration, not Arabic Unicode; this model is trained solely on the phonetic column and produces phoneme sequences, not orthographic Arabic script.

Citation

If you use this model, please cite the base model and dataset:

@misc{conneau2020unsupervised,
  title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
  author={Conneau, Alexis and others},
  year={2020},
  eprint={2006.13979},
  archivePrefix={arXiv}
}

License

Inherits the license of facebook/wav2vec2-large-xlsr-53. Please review the original model card before use in commercial applications.

Downloads last month
6
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MostafaMaroof/wav2vec2-arabic-phoneme-asr

Finetuned
(363)
this model

Dataset used to train MostafaMaroof/wav2vec2-arabic-phoneme-asr

Paper for MostafaMaroof/wav2vec2-arabic-phoneme-asr

Evaluation results