wav2vec2-v2-unfrozen — Improved Wav2Vec 2.0 for ASVspoof 2019 LA

Best improved Wav2Vec 2.0 checkpoint from a comparative study of three neural anti-spoofing architectures. Achieves the best eval EER (1.55%) and best tandem min t-DCF (0.3966) across all models in the study. Supersedes caa-speech-detection-asvspoof2019/wav2vec2.

Version: wav2vec2_v2 (top-4 transformer blocks unfrozen)

Architecture

facebook/wav2vec2-base with the top 4 transformer blocks unfrozen and a lightweight classification head.

Component	Value
Base model	`facebook/wav2vec2-base`
Encoder	Top 4 transformer blocks trainable (`freeze_last_n=4`); rest frozen
Classification head	Linear(768→256) → GELU → Dropout(0.1) → Linear(256→2)
Input	Raw waveform, 16 kHz, padded/truncated to 64 000 samples (4 s)
Parameters	~95 M total; top-4 blocks + head trainable
Checkpoint size	~362 MB

Encoder masking (mask_time_prob, mask_feature_prob) is disabled at inference, following Tak et al. (2022).

Reference: Baevski et al., "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations", NeurIPS 2020.

Training

Hyperparameter	Value
Epochs	40 (stopped early at 35)
Batch size	8
Learning rate	1e-5, cosine schedule
Weight decay	1e-4
Gradient clip norm	1.0
Early stopping patience	12
Class weights	`[3.0, 1.0]`

Dataset: ASVspoof 2019 LA train split (~25k utterances). No data augmentation.

Results

Baseline to beat: EER 8.09% (LFCC+GMM).

Split	EER	tandem min t-DCF	In-the-Wild EER
Dev	0.197%	—	—
Eval	1.55%	0.3966	27.68%

Dev EER improved from 4.199% (baseline, fully-frozen encoder) to 0.197%. Eval EER improved from 7.53% to 1.55% — an 79% relative reduction. This is the best result across all three models (LCNN, RawNet2, Wav2Vec 2.0) in the study, both in EER and t-DCF.

See learning_curves/wav2vec2_baseline_vs_improved.png for the training trajectory.

Usage

Install dependencies from the source repository, then:

import torch
from src.models.wav2vec2.model import Wav2Vec2Model

config = {
    "pretrained_model": "facebook/wav2vec2-base",
    "freeze_encoder": True,
    "freeze_last_n": 4,
    "hidden_dim": 256,
    "dropout": 0.1,
    "target_samples": 64000,
    "class_weights": [3.0, 1.0],
}

model = Wav2Vec2Model(config)
state = torch.load("best.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()

# waveform: (B, T) float32 at 16 kHz, T = 64000
with torch.no_grad():
    logits = model({"frames": waveform})["logits"]
    probs = torch.softmax(logits, dim=-1)  # [:, 0] = bonafide, [:, 1] = spoof

Source: github.com/sebastiaoteixeira/caa-ai-generated-speech-detector

Limitations

Trained on ASVspoof 2019 LA only. Not validated on physical access or other corpora.
In-the-Wild EER of 27.68% indicates limited generalisation beyond the ASVspoof 2019 attack pool.
The ~~362 MB checkpoint size makes this unsuitable for edge or real-time deployment. For resource-constrained scenarios, lcnn_v7_cqt (~~3.5 MB, 3.26% eval EER) is recommended.
No data augmentation was applied; domain shift may degrade performance in real deployments.

Citation

@inproceedings{wang2020asvspoof,
  title     = {{ASVspoof} 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech},
  author    = {Wang, Xin and others},
  booktitle = {Computer Speech \& Language},
  volume    = {64},
  year      = {2020}
}

@inproceedings{baevski2020wav2vec,
  title     = {wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
  author    = {Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2020}
}

Downloads last month: 6