voice-gender-classifier-onnx-q8

ONNX export with dynamic q8 (UInt8, weight-only, per-channel) quantization of JaesungHuh/voice-gender-classifier, prepared for Transformers.js inference in browser and mobile contexts.

The original is a binary voice-gender classifier built on the ECAPA-TDNN architecture (Desplanques et al. 2020), fine-tuned on VoxCeleb2. This repo packages it for client-side deployment: 15 MB single ONNX file (vs ~62 MB FP32 safetensors), suitable for first-load over typical residential bandwidth and within mobile WASM-runtime memory budgets.

Used in Syrinx, an open-source voice-training tool. MIT licensed — anyone is welcome to use this for any purpose under MIT terms, not Syrinx-specific.

Architecture

ECAPA-TDNN with C=1024 channels:

1D Conv1d feature extractor (80 → 1024, kernel=5)
Three Bottle2neck Res2Net blocks with SE attention (1024-wide, scale=8)
Multi-frame attention pooling (3 × 1024 → 1536, attention over time)
BatchNorm + Linear (3072 → 192 → 2)

Input: raw mono float32 audio at 16 kHz, shape (batch, time). Output: pre-softmax logits, shape (batch, 2) over {0: male, 1: female}.

The model includes its own log-mel preprocessing (logtorchfbank) baked into the ONNX graph — STFT (opset 17+), mel filterbank, log, per-frame mean normalization. Set do_normalize: false in the preprocessor config (already set in this repo's preprocessor_config.json) to avoid double-normalizing the input — the wav2vec2-style audio-side normalization that Transformers.js applies by default would distort the model's internal mel-side normalization.

About the wav2vec2 tag on this model card (HF Hub auto-tagging from config.json): the architecture is genuinely ECAPA-TDNN, not wav2vec2. The model_type: "wav2vec2" field is set deliberately so Transformers.js's audio-classification pipeline routes the model through its Wav2Vec2ForSequenceClassification JS class, which is a thin ONNX wrapper that runs the embedded graph regardless of architecture. Transformers.js doesn't have an ecapa-tdnn model_type registered, and an unrecognized model_type would cause pipeline() to fail at load time. The tag is a downstream consequence of this routing necessity, not architectural drift.

Usage with Transformers.js

import { pipeline } from "@huggingface/transformers";

const classifier = await pipeline(
  "audio-classification",
  "alice-sabrina-ivy/voice-gender-classifier-onnx-q8",
  { dtype: "q8" }
);

// audio: Float32Array of mono samples at 16 kHz
const result = await classifier(audio, { sampling_rate: 16000 });
// → [{ label: "female", score: 0.95 }, { label: "male", score: 0.05 }]

Performance

Measured on the Hillenbrand 1995 vowel corpus¹ (93 speakers, 12 vowels per speaker concatenated into ~7 s recordings, rolling 0.75 s window at 150 ms hop, EMA α=0.2):

metric	value
Female accuracy	95.8 % (46/48 speakers)
Male accuracy	95.6 % (43/45 speakers)
Within-speaker raw_std (median, female)	0.196
Within-speaker raw_std (median, male)	0.216

Inference time (q8, single 0.75 s window):

runtime	median	p95
Node native ORT (`onnxruntime-node`, Hillenbrand harness)	~11 ms	~14 ms
Chrome 147 desktop, `onnxruntime-web` + WebGPU	~191 ms	~222 ms
Chrome 147 mobile (Pixel 8 Pro), `onnxruntime-web` + WebGPU	~460 ms	~536 ms

The Node-native number is what the Hillenbrand accuracy harness above runs on, but it's roughly 18 × faster than what real browser users see. For browser-deployment ship decisions, use the browser-runtime numbers, not the Node number. The discrepancy comes from onnxruntime-node using native bindings vs onnxruntime-web running ORT compiled to WASM.

Both browser numbers are with WebGPU available — the typical Transformers.js path tries WebGPU first and falls back to WASM CPU if it isn't usable for this model on this device. Browsers without WebGPU support (older Chrome, current Firefox/Safari at time of writing) fall back to WASM CPU; perf there is unmeasured but presumably slower.

The mobile-vs-desktop browser ratio (~2.4 ×) is more favorable than transformer-class architectures show in the same testbed (e.g., wav2vec2-base runs ~4.5 × slower mobile-vs-desktop on Chrome 147 / WebGPU). ECAPA-TDNN appears to compose better with mobile WebGPU compute shaders than transformer-attention-heavy architectures do.

¹ Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. The Journal of the Acoustical Society of America, 97(5), 3099–3111.

Limitations & biases

Binary classification only. This model emits only male/female logits and cannot represent non-binary or unspecified gender identity. For voice training tools, this is a feature limitation worth disclosing to users.
Trained on VoxCeleb2, which has demographic imbalance — overrepresentation of European/American English-language speakers, less coverage of voices outside that distribution. Per JaesungHuh's original model card: "the model may not represent global population diversity." Expect degraded accuracy on voices that fall outside VoxCeleb2's distribution.
Mechanically borderline samples exist in any binary gender classifier. On the Hillenbrand corpus, m45 (rawMean 0.621, model reads weakly female despite truth=male) and w46 (rawMean 0.379, model reads weakly male despite truth=female) are persistently misclassified across α values — the model's average opinion is wrong on these specific voices, not a smoothing artifact. Voices with similar characteristics will see consistent wrong feedback.

Quantization

q8 ONNX is a derivative work — produced via:

PyTorch → ONNX export with torch.onnx.export (opset 18, dynamic axes for batch and time dimensions)
The upstream logtorchfbank constructs torchaudio.transforms.MelSpectrogram per-forward, which trips torch.export's data-dependent guard. Patched by lifting MelSpectrogram and the preemphasis kernel into pre-instantiated members on the model before tracing — numerically identical to upstream, just instantiated once instead of per-call.
Constant folding via onnxsim
Dynamic quantization via onnxruntime.quantization.quantize_dynamic with QuantType.QUInt8, per_channel=True, reduce_range=False

Numerical parity (4 Hillenbrand samples, max |Δ| from PyTorch reference):

sample	FP32 ONNX	q8 ONNX
m01ae	0	0.10
m20ae	0	0.11
w01ae	0	0.01
w20ae	0	0.03

FP32 is bit-exact to PyTorch; q8 noise is ~0.1 in probability space, with all argmax decisions preserved.

Conversion scripts live in the Syrinx repository (export-jaesunghuh-onnx.py, quantize-jaesunghuh-onnx.py, verify-jaesunghuh-onnx.py) and are reproducible from the upstream weights.

Citation

If you use this model, please cite the original ECAPA-TDNN paper:

@inproceedings{desplanques2020ecapa,
  title={ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification},
  author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  booktitle={Interspeech 2020},
  pages={3830--3834},
  year={2020},
  doi={10.21437/Interspeech.2020-2650},
  url={https://arxiv.org/abs/2005.07143}
}

And acknowledge JaesungHuh's voice-gender-classifier fine-tune:

Repo: https://github.com/JaesungHuh/voice-gender-classifier
Model card: https://huggingface.co/JaesungHuh/voice-gender-classifier

License

MIT, inherited from JaesungHuh's original. The architecture code (in turn derived from TaoRuijie's ECAPA-TDNN) is also MIT licensed. This ONNX-quantized derivative continues under MIT.

MIT License

Copyright (c) 2024 JaesungHuh
Copyright (c) 2024 Tao Ruijie (original ECAPA-TDNN implementation)
Copyright (c) 2026 Alice Sabrina Ivy (ONNX export + q8 quantization)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Original work

Architecture: TaoRuijie/ECAPA-TDNN (Tao Ruijie, MIT)
Fine-tune: JaesungHuh/voice-gender-classifier (Jaesung Huh, MIT) — VoxCeleb2 fine-tune for binary gender classification, reports 98.7 % on VoxCeleb1 test split
ONNX export + q8 quantization: this repo

Downloads last month: 11

Model tree for Alice-Sabrina-Ivy/voice-gender-classifier-onnx-q8

Base model

JaesungHuh/voice-gender-classifier

Quantized

(1)

this model

Paper for Alice-Sabrina-Ivy/voice-gender-classifier-onnx-q8

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Paper • 2005.07143 • Published May 14, 2020