fnaught — lightweight monophonic pitch detection
fnaught (“f-naught”, as f₀ is read aloud) is a 106k-parameter CNN for monophonic fundamental-frequency (f0)
estimation, distributed as a single 428 KB ONNX file. It is designed for
realtime use on CPU and in the browser, with particular attention to accuracy
on low-pitched voices (below ~110 Hz), where fixed-resolution STFT frontends
lose precision.
- Sample rate: 16 kHz, frame hop 256 samples (16 ms)
- Pitch range: 46.875–2093.75 Hz (200 log-spaced bins, decoded to continuous Hz)
- Input: dual-resolution STFT (N=1024 full-band + N=4096 low-band) log-magnitude plus instantaneous-frequency deviation, 2×184 bins per frame
- Output: per-frame pitch (Hz) and confidence in [0, 1]
Usage
JavaScript / TypeScript (npm)
npm install fnaught
import { PitchDetector } from "fnaught";
const detector = await PitchDetector.create();
const { pitchHz, confidence } = await detector.detect(audio, { sampleRate: 44100 });
The npm package implements the exact DSP frontend this model was trained with (verified numerically against the PyTorch implementation) and runs the model through onnxruntime-web in Node or the browser.
Python (onnxruntime)
The graph expects precomputed frontend features of shape [1, 2, 184, T]
(features) and returns logits [1, T, 200]. See the npm package source
(src/dsp.ts) or the project repository for the frontend definition.
Training
Trained on public pitch datasets — MIR-1K, MDB-stem-synth, PTDB-TUG, and synthetic speech generated with a phoneme-level TTS model — with background noise (CHiME-Home + Gaussian, SNR 10–30 dB), gain, and ±2-semitone pitch-shift augmentation. Voiced frames only; joint cross-entropy + log-frequency L1 objective.
Evaluation
Harmonic-mean score (RPA, cents accuracy, voicing precision/recall, octave accuracy, gross-error accuracy) on three held-out datasets under additive noise (SNR 10–30 dB), mean ± sd over 5 training seeds:
| Dataset | Score |
|---|---|
| Bach10-mf0-synth | 98.27 ± 0.14 |
| SpeechSynth | 91.06 ± 0.10 |
| Vocadito | 94.95 ± 0.04 |
Low-pitch behaviour (held-out synthetic speech, ground-truth-voiced frames): raw pitch accuracy at 80–110 Hz roughly doubles relative to a 1024-point single-window STFT baseline of the same size, and octave-error rates drop by an order of magnitude.
Evaluation used the open-source pitch benchmark suite at https://github.com/lars76/pitch-benchmark (datasets, metrics, and noise protocol as defined there).
Limitations
- Monophonic sources only; polyphonic mixtures are out of scope.
- Trained mostly on speech and singing; extreme instrument timbres or pitch outside 47–2094 Hz are not covered.
- Confidence is calibrated for voicing decisions around a 0.9 threshold; it is not a general uncertainty estimate.
License
MIT. Trained on publicly available research datasets; see dataset attributions: MIR-1K, MDB-stem-synth, PTDB-TUG, CHiME-Home.