File size: 6,605 Bytes
73b6cc3
 
c21c77c
 
 
 
 
 
 
 
 
73b6cc3
 
c21c77c
96aa555
 
c21c77c
a8e09a4
26040f8
534eb03
96aa555
 
 
 
 
 
 
c21c77c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8e09a4
26040f8
534eb03
c21c77c
 
 
 
 
 
 
 
73b6cc3
 
c21c77c
 
 
 
 
 
 
 
73b6cc3
 
c21c77c
 
 
73b6cc3
 
 
c21c77c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
license: apache-2.0
tags:
  - audio
  - anti-spoofing
  - audio-deepfake-detection
  - speech
  - asvspoof
  - wav2vec2
  - aasist
  - kan
---

# Spectra-AASIST3

[![EER% 0.00 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-0.00%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[![EER% 0.44 on SONAR](https://img.shields.io/badge/EER%25%20on%20SONAR-0.44%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[![EER% 0.51 on CVoiceFake_small](https://img.shields.io/badge/EER%25%20on%20CVoiceFake__small-0.51%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[![EER% 0.71 on CFAD](https://img.shields.io/badge/EER%25%20on%20CFAD-0.71%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[![EER% 0.83 on LibriSeVoc](https://img.shields.io/badge/EER%25%20on%20LibriSeVoc-0.83%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[![EER% 0.97 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-0.97%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[![EER% 1.20 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-1.20%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[![EER% 4.30 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-4.30%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[![EER% 4.38 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-4.38%25-green)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/spectra-aasist3/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)
[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/spectra-aasist3/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3)

**Spectra-AASIST3** β€” a speech anti-spoofing model pairing a **wav2vec 2.0 (XLS-R-300m)**
self-supervised front-end with a **KAN-enhanced AASIST** (KAN-AASIST) back-end. The model
takes a raw speech waveform and returns a score where **higher = more bona fide**.

- **Code / checkpoint:** [`lab260/Spectra-AASIST3`](https://huggingface.co/lab260/Spectra-AASIST3)
  (`model.safetensors`, self-contained β€” bundles the SSL encoder weights)
- **Paper:** none β€” **pre-release / unpublished** model, so it appears in the Arena's
  πŸ”“ *Unpublished / Proprietary* tier (listed but **unranked**, regardless of score).
- **Parameters:** ~318.95 M

The exact wrapper that produced the Arena scores is in
[`spectra_aasist3.py`](./spectra_aasist3.py); the vendored network is
[`spectra_aasist3_net.py`](./spectra_aasist3_net.py) (copied from the source `model.py`).

## Architecture

1. **wav2vec 2.0 XLS-R-300m front-end** β€” HF `transformers` `Wav2Vec2Model`
   (`facebook/wav2vec2-xls-r-300m`), producing 1024-d frame features (the base arch is
   fetched at init, then every weight is overwritten by the checkpoint).
2. **MLP bridge** β€” a single-layer `Linear(1024 β†’ 128)` projection (SELU, dropout 0.1).
3. **KAN-AASIST back-end** β€” max-pool, a RawNet2-style residual encoder, spectral (GAT-S)
   and temporal (GAT-T) graph-attention layers with graph pooling, four parallel
   inference branches with learnable master tokens, and a Kolmogorov-Arnold (KAN)
   output layer.
4. The 2-logit output is read at **index 1 = bona fide**.

## How scores are produced

- **Input:** raw audio at 16 kHz mono. **Preemphasis (0.97)** is applied to the full
  waveform (matching the source README eval pipeline), then a **deterministic
  first-64,600-sample window** (~4.04 s; tile-repeat if shorter β€” no random crop).
- **No resampling** in the wrapper (audio arrives at `expected_sample_rate = 16000`).
- **Output:** 2-class logits; the bona-fide logit (index 1) is the score.
- `batch_size = 24` (throughput plateaus ~50 utt/s for bs β‰₯ 16 on an RTX 4070 Ti SUPER).

## Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible
[Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=spectra-aasist3).
Each result is sha-pinned and reproducible from the score file via
`speech-spoof-bench reproduce --scoring`.

| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| CD-ADD | test | **0.00** | 20,786 | 0 | modern neural-TTS deepfake |
| SONAR | test | **0.44** | 3,948 | 0 | multilingual real-world deepfake |
| CVoiceFake_small | test | **0.51** | 138,136 | 0 | multilingual TTS/vocoder deepfake |
| CFAD | test | **0.71** | 62,999 | 0 | Chinese fake-audio detection |
| LibriSeVoc | test | **0.83** | 18,487 | 0 | vocoder-based deepfake |
| ASVspoof2019_LA | test | **0.97** | 71,237 | 0 | in-domain family |
| InTheWild | test | **1.20** | 31,779 | 0 | out-of-domain (real-world) |
| ASVspoof2021_DF | test | **4.30** | 611,829 | 0 | cross-dataset (deepfake) |
| ASVspoof2021_LA | test | **4.38** | 181,566 | 0 | cross-dataset (logical access) |

## Usage

The wrapper loads weights from the Hub via `PyTorchModelHubMixin`:

```python
import numpy as np
from spectra_aasist3 import SpectraAASIST3   # spectra_aasist3.py + spectra_aasist3_net.py

m = SpectraAASIST3()
m.load()                                          # from_pretrained("lab260/Spectra-AASIST3")
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0])         # higher = more bona fide
m.unload()
```

Internally the wrapper applies preemphasis, windows to 64,600 samples, runs the
network, and returns `logits[:, 1]` (class 1 = bona fide). [`spectra_aasist3.py`](./spectra_aasist3.py)
is the exact `speech_spoof_bench` model that produced the Arena `scores.txt`.

## License

Apache-2.0 β€” see the [source repository](https://huggingface.co/lab260/Spectra-AASIST3).