fnaught — lightweight monophonic pitch detection

fnaught (“f-naught”, as f₀ is read aloud) is a 106k-parameter CNN for monophonic fundamental-frequency (f0) estimation, distributed as a single 428 KB ONNX file. It is designed for realtime use on CPU and in the browser, with particular attention to accuracy on low-pitched voices (below ~110 Hz), where fixed-resolution STFT frontends lose precision.

  • Sample rate: 16 kHz, frame hop 256 samples (16 ms)
  • Pitch range: 46.875–2093.75 Hz (200 log-spaced bins, decoded to continuous Hz)
  • Input: dual-resolution STFT (N=1024 full-band + N=4096 low-band) log-magnitude plus instantaneous-frequency deviation, 2×184 bins per frame
  • Output: per-frame pitch (Hz) and confidence in [0, 1]

Usage

JavaScript / TypeScript (npm)

npm install fnaught
import { PitchDetector } from "fnaught";
const detector = await PitchDetector.create();
const { pitchHz, confidence } = await detector.detect(audio, { sampleRate: 44100 });

The npm package implements the exact DSP frontend this model was trained with (verified numerically against the PyTorch implementation) and runs the model through onnxruntime-web in Node or the browser.

Python (onnxruntime)

The graph expects precomputed frontend features of shape [1, 2, 184, T] (features) and returns logits [1, T, 200]. See the npm package source (src/dsp.ts) or the project repository for the frontend definition.

Training

Trained on public pitch datasets — MIR-1K, MDB-stem-synth, PTDB-TUG, and synthetic speech generated with a phoneme-level TTS model — with background noise (CHiME-Home + Gaussian, SNR 10–30 dB), gain, and ±2-semitone pitch-shift augmentation. Voiced frames only; joint cross-entropy + log-frequency L1 objective.

Evaluation

Harmonic-mean score (RPA, cents accuracy, voicing precision/recall, octave accuracy, gross-error accuracy) on three held-out datasets under additive noise (SNR 10–30 dB), mean ± sd over 5 training seeds:

Dataset Score
Bach10-mf0-synth 98.27 ± 0.14
SpeechSynth 91.06 ± 0.10
Vocadito 94.95 ± 0.04

Low-pitch behaviour (held-out synthetic speech, ground-truth-voiced frames): raw pitch accuracy at 80–110 Hz roughly doubles relative to a 1024-point single-window STFT baseline of the same size, and octave-error rates drop by an order of magnitude.

Evaluation used the open-source pitch benchmark suite at https://github.com/lars76/pitch-benchmark (datasets, metrics, and noise protocol as defined there).

Limitations

  • Monophonic sources only; polyphonic mixtures are out of scope.
  • Trained mostly on speech and singing; extreme instrument timbres or pitch outside 47–2094 Hz are not covered.
  • Confidence is calibrated for voicing decisions around a 0.9 threshold; it is not a general uncertainty estimate.

License

MIT. Trained on publicly available research datasets; see dataset attributions: MIR-1K, MDB-stem-synth, PTDB-TUG, CHiME-Home.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support