--- license: apache-2.0 language: - en tags: - speaker-verification - speaker-embedding - speaker-recognition - audio - ecapa-tdnn - pytorch pipeline_tag: audio-classification library_name: pytorch --- # CASE Speaker Embedding v2 (512 channels) [Case Benchmark](https://github.com/gittb/case-benchmark) **Carrier-Agnostic Speaker Embeddings (CASE)** - A robust speaker embedding model trained to generalize across acoustic carriers including phone codecs, webcam microphones, speaker playback chains, and degraded audio conditions. ## Model Description This model is based on the ECAPA-TDNN architecture with: - **512 channels** (~6.2M parameters) - **192-dimensional** L2-normalized embeddings - **Global context attention** in the pooling layer - Trained on **VoxCeleb2** with CASE v2 augmentation pipeline ### CASE v2 Augmentation Pipeline The model was trained with a 6-mode carrier augmentation strategy designed to simulate real-world acoustic degradation: | Mode | Distribution in Training Set | Description | |------|-------------|-------------| | Clean | 15% | No augmentation | | Single Codec | 10% | GSM, G.711, Opus, MP3, AAC, G.722 | | Single Mic | 10% | 10 microphone profiles (webcam, laptop, phone, etc.) | | Codec + Mic | 15% | VoIP simulation | | Light Chain | 25% | Reverb → Codec (reverberant room transmitted) | | Full Chain | 25% | Codec → Speaker → Room → Mic (replay attack) | ## Usage ### Installation ```bash pip install torch torchaudio numpy ``` ### Quick Start ```python from model import CASESpeakerEncoder # Load model encoder = CASESpeakerEncoder.from_pretrained("./") # Extract embedding from audio file embedding = encoder.encode("audio.wav") # Returns (192,) numpy array # Verify two speakers same_speaker = encoder.verify("audio1.wav", "audio2.wav", threshold=0.5) print(f"Same speaker: {same_speaker}") # Get similarity score emb1 = encoder.encode("audio1.wav") emb2 = encoder.encode("audio2.wav") similarity = encoder.similarity(emb1, emb2) print(f"Similarity: {similarity:.3f}") ``` ### Direct Model Usage ```python import torch import torchaudio from model import ECAPA_TDNN # Load model model = ECAPA_TDNN(channels=512, global_context_att=True) state_dict = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True) model.load_state_dict(state_dict) model.eval() # Load audio (must be 16kHz) wav, sr = torchaudio.load("audio.wav") if sr != 16000: wav = torchaudio.transforms.Resample(sr, 16000)(wav) wav = wav.mean(dim=0) # Mono # Extract embedding with torch.no_grad(): embedding = model(wav.unsqueeze(0)) # (1, 192) ``` ### Batch Processing ```python # Process multiple files audio_files = ["spk1_utt1.wav", "spk1_utt2.wav", "spk2_utt1.wav"] embeddings = encoder.encode_batch(audio_files) # (N, 192) # Compute pairwise similarities import numpy as np similarity_matrix = embeddings @ embeddings.T ``` ## Input Requirements | Parameter | Value | |-----------|-------| | Sample Rate | 16000 Hz | | Channels | Mono | | Format | Float32 in [-1, 1] range | | Min Duration | ~0.5 seconds recommended | | Max Duration | Any (uses attention pooling) | ## Output - **Embedding dimension**: 192 - **Normalization**: L2-normalized (unit norm) - **Similarity metric**: Cosine similarity (dot product for normalized vectors) ## Training Details | Parameter | Value | |-----------|-------| | Architecture | ECAPA-TDNN (512 channels) | | Dataset | VoxCeleb2 (5,994 speakers) | | Loss | AAM-Softmax (margin=0.2, scale=30) | | Optimizer | Adam (lr=0.001) | | Epochs | 70 | | Augmentation | CASE v2 + MUSAN noise | ## Benchmark Results (CASE Benchmark) Evaluated on the [CASE Benchmark](https://github.com/gittb/case-benchmark): | Metric | Value | |--------|-------| | **Clean EER** | 1.22% | | **Absolute EER** | 3.53% | | **Degradation** | +2.31% | | Category | Avg EER | |----------|---------| | Clean | 1.22% | | Codec | 1.69% | | Mic | 1.23% | | Noise | 1.35% | | Reverb | 6.56% | | Playback | 9.10% | **Key Finding:** Achieves the **lowest degradation factor** (+2.31%) among tested models, validating the carrier-agnostic training approach. ## Intended Use This model is designed for: - **Speaker verification**: Determining if two audio samples are from the same speaker - **Speaker identification**: Matching against a database of enrolled speakers - **Speaker diarization**: As an embedding extractor for clustering - **Robustness testing**: Evaluating systems under acoustic degradation ### Robustness Focus Unlike standard speaker embedding models, CASE is specifically trained to maintain performance when audio is degraded by: - Telephone codecs (GSM, G.711, AMR) - VoIP compression (Opus, AAC) - Microphone variability (webcam, laptop, phone mics) - Room acoustics and reverberation - Replay attacks (speaker playback chains) ## Limitations - Optimized for speech; may not perform well on non-speech audio - Best performance with audio >1 second - Not designed for speaker separation or enhancement - English-centric training data (VoxCeleb) ## Related Resources | Resource | Description | Link | |----------|-------------|------| | **CASE Benchmark** | Evaluation dataset with 24 protocols | [HuggingFace Dataset](https://huggingface.co/datasets/gittb/case-benchmark) | | **Benchmark Code** | Evaluation scripts and tools | [GitHub](https://github.com/gittb/case-benchmark) | | **Results** | Full leaderboard and per-protocol breakdowns | [Results](https://github.com/gittb/case-benchmark/tree/master/results) | | **Metrics Guide** | How to interpret benchmark metrics | [Metrics Documentation](https://github.com/gittb/case-benchmark/blob/master/docs/metrics.md) | ## Citation If you use this model, please cite: ```bibtex @misc{case-speaker-embedding, title={CASE: Carrier-Agnostic Speaker Embeddings}, year={2026}, url={https://github.com/gittb/case-benchmark} } ``` ## License Apache 2.0 ## References - ECAPA-TDNN: [Desplanques et al., 2020](https://arxiv.org/abs/2005.07143) - VoxCeleb: [Nagrani et al., 2020](https://arxiv.org/abs/2012.06867)