README.md · StemSplitio/htdemucs-ft-onnx at main

File size: 9,881 Bytes

---
language: en
license: mit
library_name: onnxruntime
pipeline_tag: audio-to-audio
tags:
  - onnx
  - onnxruntime
  - stem-separation
  - source-separation
  - vocal-isolation
  - vocal-remover
  - drum-extraction
  - bass-extraction
  - karaoke
  - demucs
  - htdemucs
  - music
  - audio-to-audio
  - mobile
  - ios
  - android
  - coreml
  - directml
  - production-ready
datasets:
  - StemSplitio/stem-separation-benchmark-2026
inference: false
---

# HT-Demucs FT — Full 4-Stem Bag, ONNX

**The first complete ONNX export of HT-Demucs FT on the Hugging Face Hub.**
Four parity-verified ONNX models (drums, bass, other, vocals) plus a
~250-line numpy aggregator that runs the full 4-stem separation in pure
`onnxruntime`. **No PyTorch required at inference.** Runs on CPU /
CoreML / CUDA / DirectML.

This repo is the convenience drop — all 4 specialist sub-models of
`htdemucs_ft` in one place, with a working bag-inference script. If you
only need one stem in production, the individual stem-specialist repos
below are ~75% smaller and ~4× faster per song.

---

## TL;DR

```bash
pip install onnxruntime numpy soundfile
python bag_infer.py your-song.mp3 ./out/
# writes out/drums.wav, out/bass.wav, out/other.wav, out/vocals.wav
```

That's it. The 4 `.onnx` files (316 MB each, ~1.26 GB total) live
alongside the script.

---

## Quality

Median per-stem SDR on the MUSDB18-HQ test split (50 songs), BSS Eval v4
via `museval`. **Identical to the official PyTorch `htdemucs_ft`** — the
bag's per-stem output IS the corresponding specialist's output (the weight
matrix is one-hot per stem).

| Stem | SDR (dB) | Rank in our 2026 benchmark |
|---|---:|---|
| **vocals** | **9.19** | **#1** (highest open-source vocal SDR) |
| drums | 10.11 | #2 (mdx_extra_q leads at 11.49) |
| bass | 10.38 | #2 (mdx_extra_q leads at 11.42) |
| other | 6.34 | #2 (mdx_extra_q leads at 7.67) |

Full benchmark across every popular open-source separator:
[StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026).

**ONNX vs PyTorch parity:** verified to < 1e-3 max abs diff on every stem
during export. See the
[Day 1 spike report](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx#how-it-was-built)
for the full engineering writeup.

---

## Performance

Real measurements on an Apple M4 Pro:

| Mode | Hardware | Per 3-min song | Notes |
|---|---|---:|---|
| One specialist (`htdemucs-ft-drums-onnx`) | M4 Pro CPU | **~22 s** | 4× faster, 75% smaller — use this if you only need one stem |
| **Full bag (this repo)** | M4 Pro CPU | **~88 s** | RTF ~0.5. 4 sub-models × N chunks. |
| Full bag | M4 Pro CPU (8 threads) | ~60 s | With `OMP_NUM_THREADS=8` and SessionOptions tuned |
| Full bag | NVIDIA L4 CUDA | ~6 s | Extrapolated from per-specialist CUDA numbers |
| Full bag | NVIDIA T4 | ~16 s | Extrapolated |
| PyTorch full bag | M4 Pro MPS | ~47 s | Faster only because MPS is GPU-accelerated; ONNX-CUDA beats it cleanly. |

---

## Tooling — `demucs-onnx` Python package

This bag is also packaged in the open-source
[`demucs-onnx`](https://github.com/StemSplit/demucs-onnx) Python package
on PyPI. It auto-downloads each specialist from the matching HF repo on
first use, so you don't even need to manually fetch the four `.onnx`
files.

```bash
pip install demucs-onnx

# Full 4-stem separation (auto-downloads ~1.26 GB on first run)
demucs-onnx separate song.mp3 stems/

# From Python
python -c "from demucs_onnx import separate; stems = separate('song.mp3')"
```

The same package is also the canonical tool for **exporting** htdemucs
to ONNX yourself — it bundles all four blocker fixes (complex STFT,
`fractions.Fraction`, `random.randrange`,
`aten::_native_multi_head_attention`) so vanilla `torch.onnx.export`
works on your own demucs checkpoints.

```bash
pip install "demucs-onnx[export]"
demucs-onnx export htdemucs_ft out/   # writes 4 .onnx files
```

---

## Common use cases

- **Karaoke makers** — `out/other.wav` minus `out/vocals.wav` gives a clean
  karaoke track plus an acapella in one pass.
- **DAW stem export** — drop the 4 `.wav` files into Ableton / Logic /
  Reaper as separate channels for remixing.
- **DJ stems software** — load all 4 stems as live-mixable tracks.
- **AI music apps** — feed each stem into downstream models (drum
  transcription, bassline-to-MIDI, vocal pitch correction).
- **Acapella sampling** — clean isolated vocals at the highest SDR
  available in open source.
- **Mobile / on-device separation** — replaces a 1+ GB PyTorch install
  with `onnxruntime`'s 50 MB binary on iOS / Android.

---

## Quick start

### Python — as a library

```python
import bag_infer

stems = bag_infer.separate_all("your-song.mp3")
# stems: dict[str, numpy.ndarray (2, samples)]
#   stems["drums"], stems["bass"], stems["other"], stems["vocals"]
```

### Python — with execution provider control

```python
import soundfile as sf
import bag_infer

audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
stems = bag_infer.separate(
    audio.T, sr,
    providers=["CPUExecutionProvider"],  # or "CoreMLExecutionProvider", etc.
)
for name, audio in stems.items():
    sf.write(f"{name}.wav", audio.T, sr)
```

### CLI

```bash
python bag_infer.py your-song.mp3 ./out/
python bag_infer.py your-song.mp3 ./out/ --providers cuda
python bag_infer.py your-song.mp3 ./out/ --providers coreml
python bag_infer.py your-song.mp3 ./out/ --providers dml
```

### Web / mobile

Each specialist is a vanilla onnxruntime model; just load all 4 sessions
and reuse the aggregation logic in `bag_infer.py::separate`. See the
individual stem repos for platform-specific snippets:
[drums](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx) ·
[bass](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx) ·
[other](https://huggingface.co/StemSplitio/htdemucs-ft-other-onnx) ·
[vocals](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx).

---

## How aggregation works

The `htdemucs_ft` bag uses a **one-hot weight matrix** for combining the 4
sub-models — model 0's drums output is used directly as the bag's drums
stem, model 1's bass output is the bag's bass stem, and so on. No
weighted-sum aggregation needed.

That means:
- **The bag's drums stem == the drums specialist's drums output** (bit-exact in fp32)
- Same for bass, other, vocals
- So you can ship only the specialists you need and get identical
  per-stem quality to the full bag at 1/4 the size

`bag_infer.py` simply runs all 4 specialists and picks the relevant row
from each. ~30 lines of numpy.

---

## Input / output spec per sub-model

| Tensor | Name | Shape | Dtype | Notes |
|---|---|---|---|---|
| Input | `mix` | `(1, 2, 343980)` | float32 | Stereo audio, 44.1 kHz, 7.8 s segment. |
| Output | `stems` | `(1, 4, 2, 343980)` | float32 | `[drums, bass, other, vocals]`. Use only the specialist's target row. |

For longer audio, the bag script handles overlap-add chunking.

---

## Files in this repo

| File | Size | Purpose |
|---|---:|---|
| `htdemucs_ft_drums.onnx`  | 316 MB | Drums specialist (bag index 0) |
| `htdemucs_ft_bass.onnx`   | 316 MB | Bass specialist (bag index 1) |
| `htdemucs_ft_other.onnx`  | 316 MB | Other specialist (bag index 2) |
| `htdemucs_ft_vocals.onnx` | 316 MB | Vocals specialist (bag index 3) |
| `bag_infer.py` | 7 KB | Pure numpy aggregator. No torch. |
| `requirements.txt` | <1 KB | `onnxruntime`, `numpy`, `soundfile`. |
| `README.md` | this file | |

Total: **~1.26 GB**. If that's too big, use individual stem repos.

---

## Related work

| Repo | Stem | Use when |
|---|---|---|
| [`htdemucs-ft-drums-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx) | drums | Only need drums (1/4 size, 1/4 latency) |
| [`htdemucs-ft-bass-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx) | bass | Only need bass |
| [`htdemucs-ft-other-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-other-onnx) | other | Only need "other" / instrumental |
| [`htdemucs-ft-vocals-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx) | vocals | **#1 open-source vocal SDR** |

PyTorch versions for HF Inference Endpoints:
[`htdemucs-ft-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch)
and its [4 sibling specialist repos](https://huggingface.co/StemSplitio).

---

## Skip the infrastructure — use the StemSplit API

Don't want to ship 1.26 GB of `.onnx` files in your app, manage a GPU
pool, or write overlap-add chunking? Use the
**[StemSplit API](https://stemsplit.io/developers)** instead — same models
under the hood, hosted for you, with credits and a dashboard.

- 🌐 [stemsplit.io](https://stemsplit.io)
- 📘 [Developer docs](https://stemsplit.io/developers/docs)
- 🔌 [API reference](https://stemsplit.io/developers/reference)

Or use the no-code tools that ship this same model family:

- 🎤 [Vocal Remover](https://stemsplit.io/vocal-remover)
- 🎶 [Karaoke Maker](https://stemsplit.io/karaoke-maker)
- 🎙️ [Acapella Maker](https://stemsplit.io/acapella-maker)
- 📺 [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter)

---

## License & attribution

MIT-licensed, matching the original HT-Demucs.

```bibtex
@inproceedings{rouard2023hybrid,
  title     = {Hybrid Transformers for Music Source Separation},
  author    = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
  booktitle = {ICASSP},
  year      = {2023}
}
```

- Original PyTorch model: [`facebookresearch/demucs`](https://github.com/facebookresearch/demucs)
- ONNX export, parity verification, and packaging by [StemSplit](https://stemsplit.io)
- Search keywords: htdemucs onnx, demucs onnx, htdemucs bag onnx, demucs ios, demucs android, music source separation onnx, 4-stem separation onnx, stem separation mobile, onnxruntime music separation