File size: 9,881 Bytes
72f6f07 3ea58c4 8616370 3ea58c4 72f6f07 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 | ---
language: en
license: mit
library_name: onnxruntime
pipeline_tag: audio-to-audio
tags:
- onnx
- onnxruntime
- stem-separation
- source-separation
- vocal-isolation
- vocal-remover
- drum-extraction
- bass-extraction
- karaoke
- demucs
- htdemucs
- music
- audio-to-audio
- mobile
- ios
- android
- coreml
- directml
- production-ready
datasets:
- StemSplitio/stem-separation-benchmark-2026
inference: false
---
# HT-Demucs FT β Full 4-Stem Bag, ONNX
**The first complete ONNX export of HT-Demucs FT on the Hugging Face Hub.**
Four parity-verified ONNX models (drums, bass, other, vocals) plus a
~250-line numpy aggregator that runs the full 4-stem separation in pure
`onnxruntime`. **No PyTorch required at inference.** Runs on CPU /
CoreML / CUDA / DirectML.
This repo is the convenience drop β all 4 specialist sub-models of
`htdemucs_ft` in one place, with a working bag-inference script. If you
only need one stem in production, the individual stem-specialist repos
below are ~75% smaller and ~4Γ faster per song.
---
## TL;DR
```bash
pip install onnxruntime numpy soundfile
python bag_infer.py your-song.mp3 ./out/
# writes out/drums.wav, out/bass.wav, out/other.wav, out/vocals.wav
```
That's it. The 4 `.onnx` files (316 MB each, ~1.26 GB total) live
alongside the script.
---
## Quality
Median per-stem SDR on the MUSDB18-HQ test split (50 songs), BSS Eval v4
via `museval`. **Identical to the official PyTorch `htdemucs_ft`** β the
bag's per-stem output IS the corresponding specialist's output (the weight
matrix is one-hot per stem).
| Stem | SDR (dB) | Rank in our 2026 benchmark |
|---|---:|---|
| **vocals** | **9.19** | **#1** (highest open-source vocal SDR) |
| drums | 10.11 | #2 (mdx_extra_q leads at 11.49) |
| bass | 10.38 | #2 (mdx_extra_q leads at 11.42) |
| other | 6.34 | #2 (mdx_extra_q leads at 7.67) |
Full benchmark across every popular open-source separator:
[StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026).
**ONNX vs PyTorch parity:** verified to < 1e-3 max abs diff on every stem
during export. See the
[Day 1 spike report](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx#how-it-was-built)
for the full engineering writeup.
---
## Performance
Real measurements on an Apple M4 Pro:
| Mode | Hardware | Per 3-min song | Notes |
|---|---|---:|---|
| One specialist (`htdemucs-ft-drums-onnx`) | M4 Pro CPU | **~22 s** | 4Γ faster, 75% smaller β use this if you only need one stem |
| **Full bag (this repo)** | M4 Pro CPU | **~88 s** | RTF ~0.5. 4 sub-models Γ N chunks. |
| Full bag | M4 Pro CPU (8 threads) | ~60 s | With `OMP_NUM_THREADS=8` and SessionOptions tuned |
| Full bag | NVIDIA L4 CUDA | ~6 s | Extrapolated from per-specialist CUDA numbers |
| Full bag | NVIDIA T4 | ~16 s | Extrapolated |
| PyTorch full bag | M4 Pro MPS | ~47 s | Faster only because MPS is GPU-accelerated; ONNX-CUDA beats it cleanly. |
---
## Tooling β `demucs-onnx` Python package
This bag is also packaged in the open-source
[`demucs-onnx`](https://github.com/StemSplit/demucs-onnx) Python package
on PyPI. It auto-downloads each specialist from the matching HF repo on
first use, so you don't even need to manually fetch the four `.onnx`
files.
```bash
pip install demucs-onnx
# Full 4-stem separation (auto-downloads ~1.26 GB on first run)
demucs-onnx separate song.mp3 stems/
# From Python
python -c "from demucs_onnx import separate; stems = separate('song.mp3')"
```
The same package is also the canonical tool for **exporting** htdemucs
to ONNX yourself β it bundles all four blocker fixes (complex STFT,
`fractions.Fraction`, `random.randrange`,
`aten::_native_multi_head_attention`) so vanilla `torch.onnx.export`
works on your own demucs checkpoints.
```bash
pip install "demucs-onnx[export]"
demucs-onnx export htdemucs_ft out/ # writes 4 .onnx files
```
---
## Common use cases
- **Karaoke makers** β `out/other.wav` minus `out/vocals.wav` gives a clean
karaoke track plus an acapella in one pass.
- **DAW stem export** β drop the 4 `.wav` files into Ableton / Logic /
Reaper as separate channels for remixing.
- **DJ stems software** β load all 4 stems as live-mixable tracks.
- **AI music apps** β feed each stem into downstream models (drum
transcription, bassline-to-MIDI, vocal pitch correction).
- **Acapella sampling** β clean isolated vocals at the highest SDR
available in open source.
- **Mobile / on-device separation** β replaces a 1+ GB PyTorch install
with `onnxruntime`'s 50 MB binary on iOS / Android.
---
## Quick start
### Python β as a library
```python
import bag_infer
stems = bag_infer.separate_all("your-song.mp3")
# stems: dict[str, numpy.ndarray (2, samples)]
# stems["drums"], stems["bass"], stems["other"], stems["vocals"]
```
### Python β with execution provider control
```python
import soundfile as sf
import bag_infer
audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
stems = bag_infer.separate(
audio.T, sr,
providers=["CPUExecutionProvider"], # or "CoreMLExecutionProvider", etc.
)
for name, audio in stems.items():
sf.write(f"{name}.wav", audio.T, sr)
```
### CLI
```bash
python bag_infer.py your-song.mp3 ./out/
python bag_infer.py your-song.mp3 ./out/ --providers cuda
python bag_infer.py your-song.mp3 ./out/ --providers coreml
python bag_infer.py your-song.mp3 ./out/ --providers dml
```
### Web / mobile
Each specialist is a vanilla onnxruntime model; just load all 4 sessions
and reuse the aggregation logic in `bag_infer.py::separate`. See the
individual stem repos for platform-specific snippets:
[drums](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx) Β·
[bass](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx) Β·
[other](https://huggingface.co/StemSplitio/htdemucs-ft-other-onnx) Β·
[vocals](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx).
---
## How aggregation works
The `htdemucs_ft` bag uses a **one-hot weight matrix** for combining the 4
sub-models β model 0's drums output is used directly as the bag's drums
stem, model 1's bass output is the bag's bass stem, and so on. No
weighted-sum aggregation needed.
That means:
- **The bag's drums stem == the drums specialist's drums output** (bit-exact in fp32)
- Same for bass, other, vocals
- So you can ship only the specialists you need and get identical
per-stem quality to the full bag at 1/4 the size
`bag_infer.py` simply runs all 4 specialists and picks the relevant row
from each. ~30 lines of numpy.
---
## Input / output spec per sub-model
| Tensor | Name | Shape | Dtype | Notes |
|---|---|---|---|---|
| Input | `mix` | `(1, 2, 343980)` | float32 | Stereo audio, 44.1 kHz, 7.8 s segment. |
| Output | `stems` | `(1, 4, 2, 343980)` | float32 | `[drums, bass, other, vocals]`. Use only the specialist's target row. |
For longer audio, the bag script handles overlap-add chunking.
---
## Files in this repo
| File | Size | Purpose |
|---|---:|---|
| `htdemucs_ft_drums.onnx` | 316 MB | Drums specialist (bag index 0) |
| `htdemucs_ft_bass.onnx` | 316 MB | Bass specialist (bag index 1) |
| `htdemucs_ft_other.onnx` | 316 MB | Other specialist (bag index 2) |
| `htdemucs_ft_vocals.onnx` | 316 MB | Vocals specialist (bag index 3) |
| `bag_infer.py` | 7 KB | Pure numpy aggregator. No torch. |
| `requirements.txt` | <1 KB | `onnxruntime`, `numpy`, `soundfile`. |
| `README.md` | this file | |
Total: **~1.26 GB**. If that's too big, use individual stem repos.
---
## Related work
| Repo | Stem | Use when |
|---|---|---|
| [`htdemucs-ft-drums-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx) | drums | Only need drums (1/4 size, 1/4 latency) |
| [`htdemucs-ft-bass-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx) | bass | Only need bass |
| [`htdemucs-ft-other-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-other-onnx) | other | Only need "other" / instrumental |
| [`htdemucs-ft-vocals-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx) | vocals | **#1 open-source vocal SDR** |
PyTorch versions for HF Inference Endpoints:
[`htdemucs-ft-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch)
and its [4 sibling specialist repos](https://huggingface.co/StemSplitio).
---
## Skip the infrastructure β use the StemSplit API
Don't want to ship 1.26 GB of `.onnx` files in your app, manage a GPU
pool, or write overlap-add chunking? Use the
**[StemSplit API](https://stemsplit.io/developers)** instead β same models
under the hood, hosted for you, with credits and a dashboard.
- π [stemsplit.io](https://stemsplit.io)
- π [Developer docs](https://stemsplit.io/developers/docs)
- π [API reference](https://stemsplit.io/developers/reference)
Or use the no-code tools that ship this same model family:
- π€ [Vocal Remover](https://stemsplit.io/vocal-remover)
- πΆ [Karaoke Maker](https://stemsplit.io/karaoke-maker)
- ποΈ [Acapella Maker](https://stemsplit.io/acapella-maker)
- πΊ [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter)
---
## License & attribution
MIT-licensed, matching the original HT-Demucs.
```bibtex
@inproceedings{rouard2023hybrid,
title = {Hybrid Transformers for Music Source Separation},
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
booktitle = {ICASSP},
year = {2023}
}
```
- Original PyTorch model: [`facebookresearch/demucs`](https://github.com/facebookresearch/demucs)
- ONNX export, parity verification, and packaging by [StemSplit](https://stemsplit.io)
- Search keywords: htdemucs onnx, demucs onnx, htdemucs bag onnx, demucs ios, demucs android, music source separation onnx, 4-stem separation onnx, stem separation mobile, onnxruntime music separation
|