--- license: apache-2.0 tags: - audio - anti-spoofing - audio-deepfake-detection - speech - asvspoof - wav2vec2 - aasist - kan --- # Spectra-AASIST3 [![EER% 0.00 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-0.00%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![EER% 0.44 on SONAR](https://img.shields.io/badge/EER%25%20on%20SONAR-0.44%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![EER% 0.51 on CVoiceFake_small](https://img.shields.io/badge/EER%25%20on%20CVoiceFake__small-0.51%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![EER% 0.71 on CFAD](https://img.shields.io/badge/EER%25%20on%20CFAD-0.71%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![EER% 0.83 on LibriSeVoc](https://img.shields.io/badge/EER%25%20on%20LibriSeVoc-0.83%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![EER% 0.97 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-0.97%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![EER% 1.20 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-1.20%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![EER% 4.30 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-4.30%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![EER% 4.38 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-4.38%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![EER% 15.09 on ASVspoof5](https://img.shields.io/badge/EER%25%20on%20ASVspoof5-15.09%25-lightgrey)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/spectra-aasist3/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) [![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/spectra-aasist3/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3) **Spectra-AASIST3** — a speech anti-spoofing model pairing a **wav2vec 2.0 (XLS-R-300m)** self-supervised front-end with a **KAN-enhanced AASIST** (KAN-AASIST) back-end. The model takes a raw speech waveform and returns a score where **higher = more bona fide**. - **Code / checkpoint:** [`lab260/Spectra-AASIST3`](https://huggingface.co/lab260/Spectra-AASIST3) (`model.safetensors`, self-contained — bundles the SSL encoder weights) - **Paper:** none — **pre-release / unpublished** model, so it appears in the Arena's 🔓 *Unpublished / Proprietary* tier (listed but **unranked**, regardless of score). - **Parameters:** ~318.95 M The exact wrapper that produced the Arena scores is in [`spectra_aasist3.py`](./spectra_aasist3.py); the vendored network is [`spectra_aasist3_net.py`](./spectra_aasist3_net.py) (copied from the source `model.py`). ## Architecture 1. **wav2vec 2.0 XLS-R-300m front-end** — HF `transformers` `Wav2Vec2Model` (`facebook/wav2vec2-xls-r-300m`), producing 1024-d frame features (the base arch is fetched at init, then every weight is overwritten by the checkpoint). 2. **MLP bridge** — a single-layer `Linear(1024 → 128)` projection (SELU, dropout 0.1). 3. **KAN-AASIST back-end** — max-pool, a RawNet2-style residual encoder, spectral (GAT-S) and temporal (GAT-T) graph-attention layers with graph pooling, four parallel inference branches with learnable master tokens, and a Kolmogorov-Arnold (KAN) output layer. 4. The 2-logit output is read at **index 1 = bona fide**. ## How scores are produced - **Input:** raw audio at 16 kHz mono. **Preemphasis (0.97)** is applied to the full waveform (matching the source README eval pipeline), then a **deterministic first-64,600-sample window** (~4.04 s; tile-repeat if shorter — no random crop). - **No resampling** in the wrapper (audio arrives at `expected_sample_rate = 16000`). - **Output:** 2-class logits; the bona-fide logit (index 1) is the score. - `batch_size = 24` (throughput plateaus ~50 utt/s for bs ≥ 16 on an RTX 4070 Ti SUPER). ## Benchmark result (Speech Anti-Spoofing Arena) Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3). Each result is sha-pinned and reproducible from the score file via `speech-spoof-bench reproduce --scoring`. | Dataset | Split | EER % | Trials | Skipped | Notes | |---|---|---|---|---|---| | CD-ADD | test | **0.00** | 20,786 | 0 | modern neural-TTS deepfake | | SONAR | test | **0.44** | 3,948 | 0 | multilingual real-world deepfake | | CVoiceFake_small | test | **0.51** | 138,136 | 0 | multilingual TTS/vocoder deepfake | | CFAD | test | **0.71** | 62,999 | 0 | Chinese fake-audio detection | | LibriSeVoc | test | **0.83** | 18,487 | 0 | vocoder-based deepfake | | ASVspoof2019_LA | test | **0.97** | 71,237 | 0 | in-domain family | | InTheWild | test | **1.20** | 31,779 | 0 | out-of-domain (real-world) | | ASVspoof2021_DF | test | **4.30** | 611,829 | 0 | cross-dataset (deepfake) | | ASVspoof2021_LA | test | **4.38** | 181,566 | 0 | cross-dataset (logical access) | | ASVspoof5 | test | **15.09** | 680,774 | 0 | adversarial / hardest set | ## Usage The wrapper loads weights from the Hub via `PyTorchModelHubMixin`: ```python import numpy as np from spectra_aasist3 import SpectraAASIST3 # spectra_aasist3.py + spectra_aasist3_net.py m = SpectraAASIST3() m.load() # from_pretrained("lab260/Spectra-AASIST3") audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz print(m.score_batch([audio], [16000])[0]) # higher = more bona fide m.unload() ``` Internally the wrapper applies preemphasis, windows to 64,600 samples, runs the network, and returns `logits[:, 1]` (class 1 = bona fide). [`spectra_aasist3.py`](./spectra_aasist3.py) is the exact `speech_spoof_bench` model that produced the Arena `scores.txt`. ## License Apache-2.0 — see the [source repository](https://huggingface.co/lab260/Spectra-AASIST3).