--- language: en license: mit library_name: onnxruntime pipeline_tag: audio-to-audio tags: - onnx - onnxruntime - stem-separation - source-separation - vocal-isolation - vocal-remover - drum-extraction - bass-extraction - karaoke - demucs - htdemucs - music - audio-to-audio - mobile - ios - android - coreml - directml - production-ready datasets: - StemSplitio/stem-separation-benchmark-2026 inference: false --- # HT-Demucs FT — Full 4-Stem Bag, ONNX **The first complete ONNX export of HT-Demucs FT on the Hugging Face Hub.** Four parity-verified ONNX models (drums, bass, other, vocals) plus a ~250-line numpy aggregator that runs the full 4-stem separation in pure `onnxruntime`. **No PyTorch required at inference.** Runs on CPU / CoreML / CUDA / DirectML. This repo is the convenience drop — all 4 specialist sub-models of `htdemucs_ft` in one place, with a working bag-inference script. If you only need one stem in production, the individual stem-specialist repos below are ~75% smaller and ~4× faster per song. --- ## TL;DR ```bash pip install onnxruntime numpy soundfile python bag_infer.py your-song.mp3 ./out/ # writes out/drums.wav, out/bass.wav, out/other.wav, out/vocals.wav ``` That's it. The 4 `.onnx` files (316 MB each, ~1.26 GB total) live alongside the script. --- ## Quality Median per-stem SDR on the MUSDB18-HQ test split (50 songs), BSS Eval v4 via `museval`. **Identical to the official PyTorch `htdemucs_ft`** — the bag's per-stem output IS the corresponding specialist's output (the weight matrix is one-hot per stem). | Stem | SDR (dB) | Rank in our 2026 benchmark | |---|---:|---| | **vocals** | **9.19** | **#1** (highest open-source vocal SDR) | | drums | 10.11 | #2 (mdx_extra_q leads at 11.49) | | bass | 10.38 | #2 (mdx_extra_q leads at 11.42) | | other | 6.34 | #2 (mdx_extra_q leads at 7.67) | Full benchmark across every popular open-source separator: [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026). **ONNX vs PyTorch parity:** verified to < 1e-3 max abs diff on every stem during export. See the [Day 1 spike report](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx#how-it-was-built) for the full engineering writeup. --- ## Performance Real measurements on an Apple M4 Pro: | Mode | Hardware | Per 3-min song | Notes | |---|---|---:|---| | One specialist (`htdemucs-ft-drums-onnx`) | M4 Pro CPU | **~22 s** | 4× faster, 75% smaller — use this if you only need one stem | | **Full bag (this repo)** | M4 Pro CPU | **~88 s** | RTF ~0.5. 4 sub-models × N chunks. | | Full bag | M4 Pro CPU (8 threads) | ~60 s | With `OMP_NUM_THREADS=8` and SessionOptions tuned | | Full bag | NVIDIA L4 CUDA | ~6 s | Extrapolated from per-specialist CUDA numbers | | Full bag | NVIDIA T4 | ~16 s | Extrapolated | | PyTorch full bag | M4 Pro MPS | ~47 s | Faster only because MPS is GPU-accelerated; ONNX-CUDA beats it cleanly. | --- ## Tooling — `demucs-onnx` Python package This bag is also packaged in the open-source [`demucs-onnx`](https://github.com/StemSplit/demucs-onnx) Python package on PyPI. It auto-downloads each specialist from the matching HF repo on first use, so you don't even need to manually fetch the four `.onnx` files. ```bash pip install demucs-onnx # Full 4-stem separation (auto-downloads ~1.26 GB on first run) demucs-onnx separate song.mp3 stems/ # From Python python -c "from demucs_onnx import separate; stems = separate('song.mp3')" ``` The same package is also the canonical tool for **exporting** htdemucs to ONNX yourself — it bundles all four blocker fixes (complex STFT, `fractions.Fraction`, `random.randrange`, `aten::_native_multi_head_attention`) so vanilla `torch.onnx.export` works on your own demucs checkpoints. ```bash pip install "demucs-onnx[export]" demucs-onnx export htdemucs_ft out/ # writes 4 .onnx files ``` --- ## Common use cases - **Karaoke makers** — `out/other.wav` minus `out/vocals.wav` gives a clean karaoke track plus an acapella in one pass. - **DAW stem export** — drop the 4 `.wav` files into Ableton / Logic / Reaper as separate channels for remixing. - **DJ stems software** — load all 4 stems as live-mixable tracks. - **AI music apps** — feed each stem into downstream models (drum transcription, bassline-to-MIDI, vocal pitch correction). - **Acapella sampling** — clean isolated vocals at the highest SDR available in open source. - **Mobile / on-device separation** — replaces a 1+ GB PyTorch install with `onnxruntime`'s 50 MB binary on iOS / Android. --- ## Quick start ### Python — as a library ```python import bag_infer stems = bag_infer.separate_all("your-song.mp3") # stems: dict[str, numpy.ndarray (2, samples)] # stems["drums"], stems["bass"], stems["other"], stems["vocals"] ``` ### Python — with execution provider control ```python import soundfile as sf import bag_infer audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True) stems = bag_infer.separate( audio.T, sr, providers=["CPUExecutionProvider"], # or "CoreMLExecutionProvider", etc. ) for name, audio in stems.items(): sf.write(f"{name}.wav", audio.T, sr) ``` ### CLI ```bash python bag_infer.py your-song.mp3 ./out/ python bag_infer.py your-song.mp3 ./out/ --providers cuda python bag_infer.py your-song.mp3 ./out/ --providers coreml python bag_infer.py your-song.mp3 ./out/ --providers dml ``` ### Web / mobile Each specialist is a vanilla onnxruntime model; just load all 4 sessions and reuse the aggregation logic in `bag_infer.py::separate`. See the individual stem repos for platform-specific snippets: [drums](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx) · [bass](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx) · [other](https://huggingface.co/StemSplitio/htdemucs-ft-other-onnx) · [vocals](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx). --- ## How aggregation works The `htdemucs_ft` bag uses a **one-hot weight matrix** for combining the 4 sub-models — model 0's drums output is used directly as the bag's drums stem, model 1's bass output is the bag's bass stem, and so on. No weighted-sum aggregation needed. That means: - **The bag's drums stem == the drums specialist's drums output** (bit-exact in fp32) - Same for bass, other, vocals - So you can ship only the specialists you need and get identical per-stem quality to the full bag at 1/4 the size `bag_infer.py` simply runs all 4 specialists and picks the relevant row from each. ~30 lines of numpy. --- ## Input / output spec per sub-model | Tensor | Name | Shape | Dtype | Notes | |---|---|---|---|---| | Input | `mix` | `(1, 2, 343980)` | float32 | Stereo audio, 44.1 kHz, 7.8 s segment. | | Output | `stems` | `(1, 4, 2, 343980)` | float32 | `[drums, bass, other, vocals]`. Use only the specialist's target row. | For longer audio, the bag script handles overlap-add chunking. --- ## Files in this repo | File | Size | Purpose | |---|---:|---| | `htdemucs_ft_drums.onnx` | 316 MB | Drums specialist (bag index 0) | | `htdemucs_ft_bass.onnx` | 316 MB | Bass specialist (bag index 1) | | `htdemucs_ft_other.onnx` | 316 MB | Other specialist (bag index 2) | | `htdemucs_ft_vocals.onnx` | 316 MB | Vocals specialist (bag index 3) | | `bag_infer.py` | 7 KB | Pure numpy aggregator. No torch. | | `requirements.txt` | <1 KB | `onnxruntime`, `numpy`, `soundfile`. | | `README.md` | this file | | Total: **~1.26 GB**. If that's too big, use individual stem repos. --- ## Related work | Repo | Stem | Use when | |---|---|---| | [`htdemucs-ft-drums-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx) | drums | Only need drums (1/4 size, 1/4 latency) | | [`htdemucs-ft-bass-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx) | bass | Only need bass | | [`htdemucs-ft-other-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-other-onnx) | other | Only need "other" / instrumental | | [`htdemucs-ft-vocals-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx) | vocals | **#1 open-source vocal SDR** | PyTorch versions for HF Inference Endpoints: [`htdemucs-ft-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch) and its [4 sibling specialist repos](https://huggingface.co/StemSplitio). --- ## Skip the infrastructure — use the StemSplit API Don't want to ship 1.26 GB of `.onnx` files in your app, manage a GPU pool, or write overlap-add chunking? Use the **[StemSplit API](https://stemsplit.io/developers)** instead — same models under the hood, hosted for you, with credits and a dashboard. - 🌐 [stemsplit.io](https://stemsplit.io) - 📘 [Developer docs](https://stemsplit.io/developers/docs) - 🔌 [API reference](https://stemsplit.io/developers/reference) Or use the no-code tools that ship this same model family: - 🎤 [Vocal Remover](https://stemsplit.io/vocal-remover) - 🎶 [Karaoke Maker](https://stemsplit.io/karaoke-maker) - 🎙️ [Acapella Maker](https://stemsplit.io/acapella-maker) - 📺 [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter) --- ## License & attribution MIT-licensed, matching the original HT-Demucs. ```bibtex @inproceedings{rouard2023hybrid, title = {Hybrid Transformers for Music Source Separation}, author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre}, booktitle = {ICASSP}, year = {2023} } ``` - Original PyTorch model: [`facebookresearch/demucs`](https://github.com/facebookresearch/demucs) - ONNX export, parity verification, and packaging by [StemSplit](https://stemsplit.io) - Search keywords: htdemucs onnx, demucs onnx, htdemucs bag onnx, demucs ios, demucs android, music source separation onnx, 4-stem separation onnx, stem separation mobile, onnxruntime music separation