---
license: apache-2.0
language:
  - en
tags:
  - speaker-verification
  - speaker-embedding
  - speaker-recognition
  - audio
  - ecapa-tdnn
  - pytorch
pipeline_tag: audio-classification
library_name: pytorch
---

# CASE Speaker Embedding v2 (512 channels) [Case Benchmark](https://github.com/gittb/case-benchmark)

**Carrier-Agnostic Speaker Embeddings (CASE)** - A robust speaker embedding model trained to generalize across acoustic carriers including phone codecs, webcam microphones, speaker playback chains, and degraded audio conditions.

## Model Description

This model is based on the ECAPA-TDNN architecture with:
- **512 channels** (~6.2M parameters)
- **192-dimensional** L2-normalized embeddings
- **Global context attention** in the pooling layer
- Trained on **VoxCeleb2** with CASE v2 augmentation pipeline

### CASE v2 Augmentation Pipeline

The model was trained with a 6-mode carrier augmentation strategy designed to simulate real-world acoustic degradation:

| Mode | Distribution in Training Set | Description |
|------|-------------|-------------|
| Clean | 15% | No augmentation |
| Single Codec | 10% | GSM, G.711, Opus, MP3, AAC, G.722 |
| Single Mic | 10% | 10 microphone profiles (webcam, laptop, phone, etc.) |
| Codec + Mic | 15% | VoIP simulation |
| Light Chain | 25% | Reverb → Codec (reverberant room transmitted) |
| Full Chain | 25% | Codec → Speaker → Room → Mic (replay attack) |

## Usage

### Installation

```bash
pip install torch torchaudio numpy
```

### Quick Start

```python
from model import CASESpeakerEncoder

# Load model
encoder = CASESpeakerEncoder.from_pretrained("./")

# Extract embedding from audio file
embedding = encoder.encode("audio.wav")  # Returns (192,) numpy array

# Verify two speakers
same_speaker = encoder.verify("audio1.wav", "audio2.wav", threshold=0.5)
print(f"Same speaker: {same_speaker}")

# Get similarity score
emb1 = encoder.encode("audio1.wav")
emb2 = encoder.encode("audio2.wav")
similarity = encoder.similarity(emb1, emb2)
print(f"Similarity: {similarity:.3f}")
```

### Direct Model Usage

```python
import torch
import torchaudio
from model import ECAPA_TDNN

# Load model
model = ECAPA_TDNN(channels=512, global_context_att=True)
state_dict = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

# Load audio (must be 16kHz)
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
    wav = torchaudio.transforms.Resample(sr, 16000)(wav)
wav = wav.mean(dim=0)  # Mono

# Extract embedding
with torch.no_grad():
    embedding = model(wav.unsqueeze(0))  # (1, 192)
```

### Batch Processing

```python
# Process multiple files
audio_files = ["spk1_utt1.wav", "spk1_utt2.wav", "spk2_utt1.wav"]
embeddings = encoder.encode_batch(audio_files)  # (N, 192)

# Compute pairwise similarities
import numpy as np
similarity_matrix = embeddings @ embeddings.T
```

## Input Requirements

| Parameter | Value |
|-----------|-------|
| Sample Rate | 16000 Hz |
| Channels | Mono |
| Format | Float32 in [-1, 1] range |
| Min Duration | ~0.5 seconds recommended |
| Max Duration | Any (uses attention pooling) |

## Output

- **Embedding dimension**: 192
- **Normalization**: L2-normalized (unit norm)
- **Similarity metric**: Cosine similarity (dot product for normalized vectors)

## Training Details

| Parameter | Value |
|-----------|-------|
| Architecture | ECAPA-TDNN (512 channels) |
| Dataset | VoxCeleb2 (5,994 speakers) |
| Loss | AAM-Softmax (margin=0.2, scale=30) |
| Optimizer | Adam (lr=0.001) |
| Epochs | 70 |
| Augmentation | CASE v2 + MUSAN noise |

## Benchmark Results (CASE Benchmark)

Evaluated on the [CASE Benchmark](https://github.com/gittb/case-benchmark):

| Metric | Value |
|--------|-------|
| **Clean EER** | 1.22% |
| **Absolute EER** | 3.53% |
| **Degradation** | +2.31% |

| Category | Avg EER |
|----------|---------|
| Clean | 1.22% |
| Codec | 1.69% |
| Mic | 1.23% |
| Noise | 1.35% |
| Reverb | 6.56% |
| Playback | 9.10% |

**Key Finding:** Achieves the **lowest degradation factor** (+2.31%) among tested models, validating the carrier-agnostic training approach.

## Intended Use

This model is designed for:
- **Speaker verification**: Determining if two audio samples are from the same speaker
- **Speaker identification**: Matching against a database of enrolled speakers
- **Speaker diarization**: As an embedding extractor for clustering
- **Robustness testing**: Evaluating systems under acoustic degradation

### Robustness Focus

Unlike standard speaker embedding models, CASE is specifically trained to maintain performance when audio is degraded by:
- Telephone codecs (GSM, G.711, AMR)
- VoIP compression (Opus, AAC)
- Microphone variability (webcam, laptop, phone mics)
- Room acoustics and reverberation
- Replay attacks (speaker playback chains)

## Limitations

- Optimized for speech; may not perform well on non-speech audio
- Best performance with audio >1 second
- Not designed for speaker separation or enhancement
- English-centric training data (VoxCeleb)

## Related Resources

| Resource | Description | Link |
|----------|-------------|------|
| **CASE Benchmark** | Evaluation dataset with 24 protocols | [HuggingFace Dataset](https://huggingface.co/datasets/gittb/case-benchmark) |
| **Benchmark Code** | Evaluation scripts and tools | [GitHub](https://github.com/gittb/case-benchmark) |
| **Results** | Full leaderboard and per-protocol breakdowns | [Results](https://github.com/gittb/case-benchmark/tree/master/results) |
| **Metrics Guide** | How to interpret benchmark metrics | [Metrics Documentation](https://github.com/gittb/case-benchmark/blob/master/docs/metrics.md) |

## Citation

If you use this model, please cite:

```bibtex
@misc{case-speaker-embedding,
  title={CASE: Carrier-Agnostic Speaker Embeddings},
  year={2026},
  url={https://github.com/gittb/case-benchmark}
}
```

## License

Apache 2.0

## References

- ECAPA-TDNN: [Desplanques et al., 2020](https://arxiv.org/abs/2005.07143)
- VoxCeleb: [Nagrani et al., 2020](https://arxiv.org/abs/2012.06867)