---
language: ar
license: apache-2.0
tags:
  - audio-classification
  - speaker-identification
  - quran
  - arabic
  - ecapa-tdnn
  - speechbrain
  - pytorch
datasets:
  - iarhamanwaar/quran-reciter-audio
metrics:
  - accuracy
pipeline_tag: audio-classification
---

# Quran Reciter Identification — Fine-tuned ECAPA-TDNN

Identifies which of **362 Quran reciters** is speaking from an audio clip, using a fine-tuned [ECAPA-TDNN](https://arxiv.org/abs/2005.07143) speaker encoder with cosine similarity against per-reciter sub-centroids.

## Model Description

- **Architecture**: ECAPA-TDNN (SpeechBrain `spkrec-ecapa-voxceleb`) fine-tuned with AAM-Softmax loss
- **Embedding dimension**: 192
- **Inference method**: Cosine similarity against K=3 sub-centroids per reciter (captures different vocal conditions: neutral, emotional, different acoustics)
- **Multi-crop inference**: Averages embeddings from multiple 20-second crops for robustness
- **Training data**: 8,800+ audio files across 362 reciters from MP3Quran.net
- **Validation accuracy**: 92.7% on 20-second clips

## Files

- `encoder.pth` — Fine-tuned ECAPA-TDNN encoder weights
- `centroids.pt` — Sub-centroids tensor, shape `(362, 3, 192)`
- `metadata.json` — Reciter ID to name mapping

## Usage

```python
import torch
import torch.nn.functional as F
from speechbrain.inference.speaker import EncoderClassifier

# Load encoder
encoder = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb",
    savedir="pretrained_models/spkrec-ecapa-voxceleb",
)
state_dict = torch.load("encoder.pth", map_location="cpu")
encoder.mods.load_state_dict(state_dict)

# Load centroids and metadata
centroids = torch.load("centroids.pt")  # (362, 3, 192)
import json
with open("metadata.json") as f:
    metadata = json.load(f)
id_to_reciter = metadata["id_to_reciter"]

# Identify from audio (16kHz mono waveform)
waveform = ...  # torch.Tensor, shape (1, samples)
with torch.no_grad():
    embedding = encoder.encode_batch(waveform).squeeze()
    embedding = F.normalize(embedding, p=2, dim=0)

# Cosine similarity against sub-centroids (max over K=3)
sims = torch.matmul(centroids, embedding)  # (362, 3)
scores = sims.max(dim=1).values  # (362,)
best_id = scores.argmax().item()
print(f"Reciter: {id_to_reciter[str(best_id)]}")
```

## Training Details

- **Base model**: `speechbrain/spkrec-ecapa-voxceleb`
- **Loss**: AAM-Softmax (margin=0.2, scale=30)
- **Optimizer**: AdamW with dual learning rates (encoder: 1e-4, head: 1e-3)
- **Scheduler**: CosineAnnealingWarmRestarts
- **Epochs**: 20 (early stopped at patience=8)
- **Batch size**: 8
- **Clip duration**: 20 seconds (random crop during training)
- **Augmentation**: Speed perturbation (0.9x-1.1x)

## Evaluation

Tested on client-provided YouTube clips of 9 different reciters (53 test cases):

| Metric | Score |
|--------|-------|
| Validation accuracy (clean audio) | 92.7% |
| YouTube test accuracy | 96.2% (51/53) |

## Limitations

- Optimized for 20+ second clips; shorter clips may have lower accuracy
- Emotional/crying recitation may reduce accuracy for some reciters
- Trained on studio recordings; very noisy environments may degrade performance

## Citation

If you use this model, please cite:

```
@misc{quran-reciter-id-2026,
  title={Quran Reciter Identification using Fine-tuned ECAPA-TDNN},
  author={Arham Anwaar},
  year={2026},
  url={https://huggingface.co/iarhamanwaar/quran-reciter-id-ecapa}
}
```