File size: 6,605 Bytes
73b6cc3 c21c77c 73b6cc3 c21c77c 96aa555 c21c77c a8e09a4 26040f8 534eb03 96aa555 c21c77c a8e09a4 26040f8 534eb03 c21c77c 73b6cc3 c21c77c 73b6cc3 c21c77c 73b6cc3 c21c77c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | ---
license: apache-2.0
tags:
- audio
- anti-spoofing
- audio-deepfake-detection
- speech
- asvspoof
- wav2vec2
- aasist
- kan
---
# Spectra-AASIST3
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
**Spectra-AASIST3** β a speech anti-spoofing model pairing a **wav2vec 2.0 (XLS-R-300m)**
self-supervised front-end with a **KAN-enhanced AASIST** (KAN-AASIST) back-end. The model
takes a raw speech waveform and returns a score where **higher = more bona fide**.
- **Code / checkpoint:** [`lab260/Spectra-AASIST3`](https://huggingface.co/lab260/Spectra-AASIST3)
(`model.safetensors`, self-contained β bundles the SSL encoder weights)
- **Paper:** none β **pre-release / unpublished** model, so it appears in the Arena's
π *Unpublished / Proprietary* tier (listed but **unranked**, regardless of score).
- **Parameters:** ~318.95 M
The exact wrapper that produced the Arena scores is in
[`spectra_aasist3.py`](./spectra_aasist3.py); the vendored network is
[`spectra_aasist3_net.py`](./spectra_aasist3_net.py) (copied from the source `model.py`).
## Architecture
1. **wav2vec 2.0 XLS-R-300m front-end** β HF `transformers` `Wav2Vec2Model`
(`facebook/wav2vec2-xls-r-300m`), producing 1024-d frame features (the base arch is
fetched at init, then every weight is overwritten by the checkpoint).
2. **MLP bridge** β a single-layer `Linear(1024 β 128)` projection (SELU, dropout 0.1).
3. **KAN-AASIST back-end** β max-pool, a RawNet2-style residual encoder, spectral (GAT-S)
and temporal (GAT-T) graph-attention layers with graph pooling, four parallel
inference branches with learnable master tokens, and a Kolmogorov-Arnold (KAN)
output layer.
4. The 2-logit output is read at **index 1 = bona fide**.
## How scores are produced
- **Input:** raw audio at 16 kHz mono. **Preemphasis (0.97)** is applied to the full
waveform (matching the source README eval pipeline), then a **deterministic
first-64,600-sample window** (~4.04 s; tile-repeat if shorter β no random crop).
- **No resampling** in the wrapper (audio arrives at `expected_sample_rate = 16000`).
- **Output:** 2-class logits; the bona-fide logit (index 1) is the score.
- `batch_size = 24` (throughput plateaus ~50 utt/s for bs β₯ 16 on an RTX 4070 Ti SUPER).
## Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible
[Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3).
Each result is sha-pinned and reproducible from the score file via
`speech-spoof-bench reproduce --scoring`.
| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| CD-ADD | test | **0.00** | 20,786 | 0 | modern neural-TTS deepfake |
| SONAR | test | **0.44** | 3,948 | 0 | multilingual real-world deepfake |
| CVoiceFake_small | test | **0.51** | 138,136 | 0 | multilingual TTS/vocoder deepfake |
| CFAD | test | **0.71** | 62,999 | 0 | Chinese fake-audio detection |
| LibriSeVoc | test | **0.83** | 18,487 | 0 | vocoder-based deepfake |
| ASVspoof2019_LA | test | **0.97** | 71,237 | 0 | in-domain family |
| InTheWild | test | **1.20** | 31,779 | 0 | out-of-domain (real-world) |
| ASVspoof2021_DF | test | **4.30** | 611,829 | 0 | cross-dataset (deepfake) |
| ASVspoof2021_LA | test | **4.38** | 181,566 | 0 | cross-dataset (logical access) |
## Usage
The wrapper loads weights from the Hub via `PyTorchModelHubMixin`:
```python
import numpy as np
from spectra_aasist3 import SpectraAASIST3 # spectra_aasist3.py + spectra_aasist3_net.py
m = SpectraAASIST3()
m.load() # from_pretrained("lab260/Spectra-AASIST3")
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
m.unload()
```
Internally the wrapper applies preemphasis, windows to 64,600 samples, runs the
network, and returns `logits[:, 1]` (class 1 = bona fide). [`spectra_aasist3.py`](./spectra_aasist3.py)
is the exact `speech_spoof_bench` model that produced the Arena `scores.txt`.
## License
Apache-2.0 β see the [source repository](https://huggingface.co/lab260/Spectra-AASIST3).
|