wav2vec2-v2-unfrozen β€” Improved Wav2Vec 2.0 for ASVspoof 2019 LA

Best improved Wav2Vec 2.0 checkpoint from a comparative study of three neural anti-spoofing architectures. Achieves the best eval EER (1.55%) and best tandem min t-DCF (0.3966) across all models in the study. Supersedes caa-speech-detection-asvspoof2019/wav2vec2.

Version: wav2vec2_v2 (top-4 transformer blocks unfrozen)

Architecture

facebook/wav2vec2-base with the top 4 transformer blocks unfrozen and a lightweight classification head.

Component Value
Base model facebook/wav2vec2-base
Encoder Top 4 transformer blocks trainable (freeze_last_n=4); rest frozen
Classification head Linear(768β†’256) β†’ GELU β†’ Dropout(0.1) β†’ Linear(256β†’2)
Input Raw waveform, 16 kHz, padded/truncated to 64 000 samples (4 s)
Parameters ~95 M total; top-4 blocks + head trainable
Checkpoint size ~362 MB

Encoder masking (mask_time_prob, mask_feature_prob) is disabled at inference, following Tak et al. (2022).

Reference: Baevski et al., "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations", NeurIPS 2020.

Training

Hyperparameter Value
Epochs 40 (stopped early at 35)
Batch size 8
Learning rate 1e-5, cosine schedule
Weight decay 1e-4
Gradient clip norm 1.0
Early stopping patience 12
Class weights [3.0, 1.0]

Dataset: ASVspoof 2019 LA train split (~25k utterances). No data augmentation.

Results

Baseline to beat: EER 8.09% (LFCC+GMM).

Split EER tandem min t-DCF In-the-Wild EER
Dev 0.197% β€” β€”
Eval 1.55% 0.3966 27.68%

Dev EER improved from 4.199% (baseline, fully-frozen encoder) to 0.197%. Eval EER improved from 7.53% to 1.55% β€” an 79% relative reduction. This is the best result across all three models (LCNN, RawNet2, Wav2Vec 2.0) in the study, both in EER and t-DCF.

See learning_curves/wav2vec2_baseline_vs_improved.png for the training trajectory.

Usage

Install dependencies from the source repository, then:

import torch
from src.models.wav2vec2.model import Wav2Vec2Model

config = {
    "pretrained_model": "facebook/wav2vec2-base",
    "freeze_encoder": True,
    "freeze_last_n": 4,
    "hidden_dim": 256,
    "dropout": 0.1,
    "target_samples": 64000,
    "class_weights": [3.0, 1.0],
}

model = Wav2Vec2Model(config)
state = torch.load("best.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()

# waveform: (B, T) float32 at 16 kHz, T = 64000
with torch.no_grad():
    logits = model({"frames": waveform})["logits"]
    probs = torch.softmax(logits, dim=-1)  # [:, 0] = bonafide, [:, 1] = spoof

Source: github.com/sebastiaoteixeira/caa-ai-generated-speech-detector

Limitations

  • Trained on ASVspoof 2019 LA only. Not validated on physical access or other corpora.
  • In-the-Wild EER of 27.68% indicates limited generalisation beyond the ASVspoof 2019 attack pool.
  • The 362 MB checkpoint size makes this unsuitable for edge or real-time deployment. For resource-constrained scenarios, lcnn_v7_cqt (3.5 MB, 3.26% eval EER) is recommended.
  • No data augmentation was applied; domain shift may degrade performance in real deployments.

Citation

@inproceedings{wang2020asvspoof,
  title     = {{ASVspoof} 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech},
  author    = {Wang, Xin and others},
  booktitle = {Computer Speech \& Language},
  volume    = {64},
  year      = {2020}
}
@inproceedings{baevski2020wav2vec,
  title     = {wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
  author    = {Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2020}
}
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support