--- language: ar license: apache-2.0 tags: - audio-classification - speaker-identification - quran - arabic - ecapa-tdnn - speechbrain - pytorch datasets: - iarhamanwaar/quran-reciter-audio metrics: - accuracy pipeline_tag: audio-classification --- # Quran Reciter Identification — Fine-tuned ECAPA-TDNN Identifies which of **362 Quran reciters** is speaking from an audio clip, using a fine-tuned [ECAPA-TDNN](https://arxiv.org/abs/2005.07143) speaker encoder with cosine similarity against per-reciter sub-centroids. ## Model Description - **Architecture**: ECAPA-TDNN (SpeechBrain `spkrec-ecapa-voxceleb`) fine-tuned with AAM-Softmax loss - **Embedding dimension**: 192 - **Inference method**: Cosine similarity against K=3 sub-centroids per reciter (captures different vocal conditions: neutral, emotional, different acoustics) - **Multi-crop inference**: Averages embeddings from multiple 20-second crops for robustness - **Training data**: 8,800+ audio files across 362 reciters from MP3Quran.net - **Validation accuracy**: 92.7% on 20-second clips ## Files - `encoder.pth` — Fine-tuned ECAPA-TDNN encoder weights - `centroids.pt` — Sub-centroids tensor, shape `(362, 3, 192)` - `metadata.json` — Reciter ID to name mapping ## Usage ```python import torch import torch.nn.functional as F from speechbrain.inference.speaker import EncoderClassifier # Load encoder encoder = EncoderClassifier.from_hparams( source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb", ) state_dict = torch.load("encoder.pth", map_location="cpu") encoder.mods.load_state_dict(state_dict) # Load centroids and metadata centroids = torch.load("centroids.pt") # (362, 3, 192) import json with open("metadata.json") as f: metadata = json.load(f) id_to_reciter = metadata["id_to_reciter"] # Identify from audio (16kHz mono waveform) waveform = ... # torch.Tensor, shape (1, samples) with torch.no_grad(): embedding = encoder.encode_batch(waveform).squeeze() embedding = F.normalize(embedding, p=2, dim=0) # Cosine similarity against sub-centroids (max over K=3) sims = torch.matmul(centroids, embedding) # (362, 3) scores = sims.max(dim=1).values # (362,) best_id = scores.argmax().item() print(f"Reciter: {id_to_reciter[str(best_id)]}") ``` ## Training Details - **Base model**: `speechbrain/spkrec-ecapa-voxceleb` - **Loss**: AAM-Softmax (margin=0.2, scale=30) - **Optimizer**: AdamW with dual learning rates (encoder: 1e-4, head: 1e-3) - **Scheduler**: CosineAnnealingWarmRestarts - **Epochs**: 20 (early stopped at patience=8) - **Batch size**: 8 - **Clip duration**: 20 seconds (random crop during training) - **Augmentation**: Speed perturbation (0.9x-1.1x) ## Evaluation Tested on client-provided YouTube clips of 9 different reciters (53 test cases): | Metric | Score | |--------|-------| | Validation accuracy (clean audio) | 92.7% | | YouTube test accuracy | 96.2% (51/53) | ## Limitations - Optimized for 20+ second clips; shorter clips may have lower accuracy - Emotional/crying recitation may reduce accuracy for some reciters - Trained on studio recordings; very noisy environments may degrade performance ## Citation If you use this model, please cite: ``` @misc{quran-reciter-id-2026, title={Quran Reciter Identification using Fine-tuned ECAPA-TDNN}, author={Arham Anwaar}, year={2026}, url={https://huggingface.co/iarhamanwaar/quran-reciter-id-ecapa} } ```