---
license: mit
tags:
  - audio-classification
  - anti-spoofing
  - asvspoof
  - deepfake-detection
datasets:
  - asvspoof2019
pipeline_tag: audio-classification
metrics:
  - eer
  - t-dcf
---

# wav2vec2-v2-unfrozen — Improved Wav2Vec 2.0 for ASVspoof 2019 LA

Best improved Wav2Vec 2.0 checkpoint from a comparative study of three neural anti-spoofing architectures. Achieves the **best eval EER (1.55%) and best tandem min t-DCF (0.3966)** across all models in the study. Supersedes [caa-speech-detection-asvspoof2019/wav2vec2](https://huggingface.co/caa-speech-detection-asvspoof2019/wav2vec2).

**Version:** wav2vec2_v2 (top-4 transformer blocks unfrozen)

## Architecture

`facebook/wav2vec2-base` with the top 4 transformer blocks unfrozen and a lightweight classification head.

| Component | Value |
|---|---|
| Base model | `facebook/wav2vec2-base` |
| Encoder | Top 4 transformer blocks trainable (`freeze_last_n=4`); rest frozen |
| Classification head | Linear(768→256) → GELU → Dropout(0.1) → Linear(256→2) |
| Input | Raw waveform, 16 kHz, padded/truncated to 64 000 samples (4 s) |
| Parameters | ~95 M total; top-4 blocks + head trainable |
| Checkpoint size | ~362 MB |

Encoder masking (`mask_time_prob`, `mask_feature_prob`) is disabled at inference, following Tak et al. (2022).

Reference: Baevski et al., *"wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations"*, NeurIPS 2020.

## Training

| Hyperparameter | Value |
|---|---|
| Epochs | 40 (stopped early at 35) |
| Batch size | 8 |
| Learning rate | 1e-5, cosine schedule |
| Weight decay | 1e-4 |
| Gradient clip norm | 1.0 |
| Early stopping patience | 12 |
| Class weights | `[3.0, 1.0]` |

Dataset: ASVspoof 2019 LA train split (~25k utterances). No data augmentation.

## Results

Baseline to beat: **EER 8.09%** (LFCC+GMM).

| Split | EER | tandem min t-DCF | In-the-Wild EER |
|---|---|---|---|
| Dev | 0.197% | — | — |
| **Eval** | **1.55%** | **0.3966** | **27.68%** |

Dev EER improved from 4.199% (baseline, fully-frozen encoder) to 0.197%. Eval EER improved from 7.53% to 1.55% — an 79% relative reduction. This is the best result across all three models (LCNN, RawNet2, Wav2Vec 2.0) in the study, both in EER and t-DCF.

See `learning_curves/wav2vec2_baseline_vs_improved.png` for the training trajectory.

## Usage

Install dependencies from the source repository, then:

```python
import torch
from src.models.wav2vec2.model import Wav2Vec2Model

config = {
    "pretrained_model": "facebook/wav2vec2-base",
    "freeze_encoder": True,
    "freeze_last_n": 4,
    "hidden_dim": 256,
    "dropout": 0.1,
    "target_samples": 64000,
    "class_weights": [3.0, 1.0],
}

model = Wav2Vec2Model(config)
state = torch.load("best.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()

# waveform: (B, T) float32 at 16 kHz, T = 64000
with torch.no_grad():
    logits = model({"frames": waveform})["logits"]
    probs = torch.softmax(logits, dim=-1)  # [:, 0] = bonafide, [:, 1] = spoof
```

Source: [github.com/sebastiaoteixeira/caa-ai-generated-speech-detector](https://github.com/sebastiaoteixeira/caa-ai-generated-speech-detector)

## Limitations

- Trained on ASVspoof 2019 LA only. Not validated on physical access or other corpora.
- In-the-Wild EER of 27.68% indicates limited generalisation beyond the ASVspoof 2019 attack pool.
- The ~362 MB checkpoint size makes this unsuitable for edge or real-time deployment. For resource-constrained scenarios, `lcnn_v7_cqt` (~3.5 MB, 3.26% eval EER) is recommended.
- No data augmentation was applied; domain shift may degrade performance in real deployments.

## Citation

```bibtex
@inproceedings{wang2020asvspoof,
  title     = {{ASVspoof} 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech},
  author    = {Wang, Xin and others},
  booktitle = {Computer Speech \& Language},
  volume    = {64},
  year      = {2020}
}
```

```bibtex
@inproceedings{baevski2020wav2vec,
  title     = {wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
  author    = {Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2020}
}
```