Initial release: htdemucs_ft bass specialist (PyTorch handler)
Browse files- README.md +169 -0
- handler.py +89 -0
- requirements.txt +5 -0
README.md
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: mit
|
| 4 |
+
library_name: demucs
|
| 5 |
+
pipeline_tag: audio-to-audio
|
| 6 |
+
tags:
|
| 7 |
+
- demucs
|
| 8 |
+
- stem-separation
|
| 9 |
+
- source-separation
|
| 10 |
+
- bass-isolation
|
| 11 |
+
- music
|
| 12 |
+
- htdemucs
|
| 13 |
+
- audio-to-audio
|
| 14 |
+
- bass-extraction
|
| 15 |
+
- bass-isolation
|
| 16 |
+
- bassline-extraction
|
| 17 |
+
datasets:
|
| 18 |
+
- StemSplitio/stem-separation-benchmark-2026
|
| 19 |
+
inference: false
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
# HT-Demucs FT — Bass Specialist (PyTorch)
|
| 23 |
+
|
| 24 |
+
Bass isolation specialist from HT-Demucs FT, ~1/4 the size of the full ensemble.
|
| 25 |
+
|
| 26 |
+
This is sub-model 1 of the 4-bag `htdemucs_ft` ensemble by
|
| 27 |
+
[Défossez et al. (Meta AI)][demucs-repo], extracted as a standalone
|
| 28 |
+
~160 MB model. It produces the **bass** stem with the same quality as
|
| 29 |
+
the full ensemble (median SDR **10.38 dB** on MUSDB18-HQ — 2nd (close behind mdx_extra_q at 11.42) of all
|
| 30 |
+
models in our 2026 benchmark) at roughly 1/4 the compute cost.
|
| 31 |
+
|
| 32 |
+
> Want all 4 stems in one request? Use the full ensemble:
|
| 33 |
+
> [`StemSplitio/htdemucs-ft-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch)
|
| 34 |
+
>
|
| 35 |
+
> Want a hosted REST API with credits and a dashboard? Use the
|
| 36 |
+
> [**StemSplit API**](https://stemsplit.io/developers).
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## Why this model
|
| 41 |
+
|
| 42 |
+
| Property | This model | Full `htdemucs_ft` bag |
|
| 43 |
+
|---|---|---|
|
| 44 |
+
| Disk size | **~160 MB** | ~640 MB |
|
| 45 |
+
| Per-3-min-song latency (M4 Pro MPS) | **~22 s** (RTF 0.12) | ~47 s (RTF 0.26) |
|
| 46 |
+
| Bass SDR on MUSDB18-HQ | **10.38 dB** | 10.38 dB *(identical — the bag's `bass` output IS this sub-model's output)* |
|
| 47 |
+
| Other stems returned | None (focused) | All 4 |
|
| 48 |
+
|
| 49 |
+
If you only need the bass stem in production, this is **strictly faster and
|
| 50 |
+
smaller** than the full ensemble with identical bass quality —
|
| 51 |
+
**~2.6× faster wall time** in our smoke tests on M4 Pro MPS.
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
## Common use cases
|
| 56 |
+
|
| 57 |
+
- **Bassline transcription** — extract bass for tab generation, MIDI conversion, or chord detection
|
| 58 |
+
- **Mix rebalancing** — isolate and re-equalise the bass bus on a finished mix
|
| 59 |
+
- **Music education** — learn basslines from any record by hearing them isolated
|
| 60 |
+
- **Sub-bass mastering reference** — compare your low-end against pro mixes
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
## Quick start (Python)
|
| 65 |
+
|
| 66 |
+
```python
|
| 67 |
+
import base64, io, soundfile as sf
|
| 68 |
+
from huggingface_hub import InferenceClient
|
| 69 |
+
|
| 70 |
+
with open("your-song.mp3", "rb") as f:
|
| 71 |
+
audio_b64 = base64.b64encode(f.read()).decode()
|
| 72 |
+
|
| 73 |
+
client = InferenceClient(model="StemSplitio/htdemucs-ft-bass-pytorch")
|
| 74 |
+
result = client.post(json={"inputs": audio_b64})
|
| 75 |
+
|
| 76 |
+
wav, sr = sf.read(io.BytesIO(base64.b64decode(result["bass"])))
|
| 77 |
+
sf.write("out_bass.wav", wav, sr)
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
Or run locally without Hugging Face at all:
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
import torch, soundfile as sf
|
| 84 |
+
from demucs.apply import apply_model
|
| 85 |
+
from demucs.audio import convert_audio
|
| 86 |
+
from demucs.pretrained import get_model
|
| 87 |
+
|
| 88 |
+
bag = get_model("htdemucs_ft")
|
| 89 |
+
model = bag.models[1].eval() # the bass specialist
|
| 90 |
+
wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
|
| 91 |
+
wav = torch.from_numpy(wav.T).contiguous()
|
| 92 |
+
wav = convert_audio(wav, sr, bag.samplerate, bag.audio_channels).unsqueeze(0)
|
| 93 |
+
|
| 94 |
+
with torch.no_grad():
|
| 95 |
+
stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]
|
| 96 |
+
|
| 97 |
+
# bag.sources == ["drums", "bass", "other", "vocals"]; pick the bass row
|
| 98 |
+
sf.write("out_bass.wav", stems[bag.sources.index("bass")].T.numpy(), bag.samplerate)
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## Deploy on Hugging Face Inference Endpoints
|
| 104 |
+
|
| 105 |
+
Click **Deploy → Inference Endpoints** above, pick a GPU instance, and HF
|
| 106 |
+
will spin up a container running [`handler.py`](handler.py).
|
| 107 |
+
|
| 108 |
+
| Hardware | Latency for 3-min song |
|
| 109 |
+
|---|---:|
|
| 110 |
+
| NVIDIA L4 | ~3 s |
|
| 111 |
+
| NVIDIA T4 small | ~7 s |
|
| 112 |
+
| CPU x4 (basic) | ~48 s |
|
| 113 |
+
|
| 114 |
+
(Roughly 2.6× faster than the full-bag latency, since we run only this
|
| 115 |
+
specialist sub-model. Cloud GPU numbers extrapolated from M4 Pro measurements.)
|
| 116 |
+
|
| 117 |
+
```bash
|
| 118 |
+
curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
|
| 119 |
+
-H "Authorization: Bearer $HF_TOKEN" \
|
| 120 |
+
-H "Content-Type: application/json" \
|
| 121 |
+
-d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
## Try it in your browser, no code
|
| 127 |
+
|
| 128 |
+
- [StemSplit](https://stemsplit.io)
|
| 129 |
+
- [StemSplit API](https://stemsplit.io/developers)
|
| 130 |
+
- [Developer docs](https://stemsplit.io/developers/docs)
|
| 131 |
+
- [API reference](https://stemsplit.io/developers/reference)
|
| 132 |
+
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
## Related models from StemSplit
|
| 136 |
+
|
| 137 |
+
| Repo | Stem | When to use |
|
| 138 |
+
|---|---|---|
|
| 139 |
+
| [`htdemucs-ft-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch) | all 4 | When you need vocals + drums + bass + other in one request |
|
| 140 |
+
| [`htdemucs-ft-vocals-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch) | vocals | Best vocal SDR in our benchmark (9.19 dB) — karaoke, acapella |
|
| 141 |
+
| [`htdemucs-ft-drums-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-pytorch) | drums | Drum extraction, beat transcription, sample-pack creation |
|
| 142 |
+
| [`htdemucs-ft-bass-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-pytorch) | bass | Bassline transcription, mix rebalancing |
|
| 143 |
+
| [`htdemucs-ft-other-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-other-pytorch) | other / instrumental | Karaoke instrumentals, sample-flipping, music-bed extraction |
|
| 144 |
+
|
| 145 |
+
Full benchmark across every popular open-source separator:
|
| 146 |
+
[StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026).
|
| 147 |
+
|
| 148 |
+
---
|
| 149 |
+
|
| 150 |
+
## License & attribution
|
| 151 |
+
|
| 152 |
+
This repo is **MIT-licensed**, matching the original HT-Demucs.
|
| 153 |
+
|
| 154 |
+
**Original authors (please cite if you use this model in research):**
|
| 155 |
+
|
| 156 |
+
```bibtex
|
| 157 |
+
@inproceedings{rouard2023hybrid,
|
| 158 |
+
title = {Hybrid Transformers for Music Source Separation},
|
| 159 |
+
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
|
| 160 |
+
booktitle = {ICASSP},
|
| 161 |
+
year = {2023}
|
| 162 |
+
}
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
- Original model: [`facebookresearch/demucs`][demucs-repo]
|
| 166 |
+
- Packaging by [StemSplit](https://stemsplit.io)
|
| 167 |
+
- Search keywords: bass extraction, isolate bass from song, bassline extractor, AI bass separator
|
| 168 |
+
|
| 169 |
+
[demucs-repo]: https://github.com/facebookresearch/demucs
|
handler.py
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
HF Inference Endpoint handler for the HT-Demucs FT **bass** specialist.
|
| 3 |
+
|
| 4 |
+
This repo ships only sub-model 1 of the 4-bag htdemucs_ft ensemble
|
| 5 |
+
— the one trained to extract `bass`. ~160 MB on disk and ~1/4 the inference
|
| 6 |
+
cost of the full bag, with the same per-stem quality as our v1.1 benchmark
|
| 7 |
+
(median bass SDR = 10.38 dB).
|
| 8 |
+
|
| 9 |
+
If you need all 4 stems in one request, use the full ensemble:
|
| 10 |
+
https://huggingface.co/StemSplitio/htdemucs-ft-pytorch
|
| 11 |
+
|
| 12 |
+
Request shape:
|
| 13 |
+
POST /
|
| 14 |
+
Content-Type: application/json
|
| 15 |
+
{ "inputs": "<base64-encoded audio bytes>" }
|
| 16 |
+
|
| 17 |
+
Response shape:
|
| 18 |
+
{ "bass": "<base64 WAV>", "sample_rate": 44100, "duration_s": 123.4 }
|
| 19 |
+
"""
|
| 20 |
+
from __future__ import annotations
|
| 21 |
+
|
| 22 |
+
import base64
|
| 23 |
+
import io
|
| 24 |
+
from typing import Any
|
| 25 |
+
|
| 26 |
+
import numpy as np
|
| 27 |
+
import soundfile as sf
|
| 28 |
+
import torch
|
| 29 |
+
from demucs.apply import apply_model
|
| 30 |
+
from demucs.audio import convert_audio
|
| 31 |
+
from demucs.pretrained import get_model
|
| 32 |
+
|
| 33 |
+
# Which sub-model of the htdemucs_ft bag to ship + which output index is ours.
|
| 34 |
+
BAG_INDEX = 1
|
| 35 |
+
TARGET_STEM = "bass"
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def _audio_to_b64_wav(audio: torch.Tensor, sample_rate: int) -> str:
|
| 39 |
+
np_audio = np.clip(audio.cpu().numpy().T, -1.0, 1.0)
|
| 40 |
+
buf = io.BytesIO()
|
| 41 |
+
sf.write(buf, np_audio, sample_rate, subtype="PCM_16", format="WAV")
|
| 42 |
+
return base64.b64encode(buf.getvalue()).decode("ascii")
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
class EndpointHandler:
|
| 46 |
+
def __init__(self, path: str = "") -> None:
|
| 47 |
+
# Load the full bag, then drop the other 3 sub-models so only the
|
| 48 |
+
# bass specialist stays in memory.
|
| 49 |
+
bag = get_model("htdemucs_ft")
|
| 50 |
+
self.model = bag.models[BAG_INDEX]
|
| 51 |
+
self.model.eval()
|
| 52 |
+
self.device = torch.device(
|
| 53 |
+
"cuda" if torch.cuda.is_available() else
|
| 54 |
+
"mps" if torch.backends.mps.is_available() else
|
| 55 |
+
"cpu"
|
| 56 |
+
)
|
| 57 |
+
self.model.to(self.device)
|
| 58 |
+
self.sample_rate = int(bag.samplerate)
|
| 59 |
+
self.audio_channels = int(bag.audio_channels)
|
| 60 |
+
self.sources = list(bag.sources) # ["drums","bass","other","vocals"]
|
| 61 |
+
self.target_index = self.sources.index(TARGET_STEM)
|
| 62 |
+
|
| 63 |
+
def __call__(self, data: dict[str, Any]) -> dict[str, Any]:
|
| 64 |
+
if "inputs" not in data:
|
| 65 |
+
return {"error": "Request body must include base64 audio under 'inputs'."}
|
| 66 |
+
|
| 67 |
+
try:
|
| 68 |
+
audio_bytes = base64.b64decode(data["inputs"])
|
| 69 |
+
wav_np, sr = sf.read(io.BytesIO(audio_bytes), dtype="float32", always_2d=True)
|
| 70 |
+
except Exception as e: # noqa: BLE001
|
| 71 |
+
return {"error": f"Could not decode audio: {type(e).__name__}: {e}"}
|
| 72 |
+
|
| 73 |
+
wav = torch.from_numpy(wav_np.T).contiguous()
|
| 74 |
+
wav = convert_audio(wav, sr, self.sample_rate, self.audio_channels)
|
| 75 |
+
wav = wav.unsqueeze(0).to(self.device)
|
| 76 |
+
|
| 77 |
+
with torch.no_grad():
|
| 78 |
+
# apply_model on a single Model (not a BagOfModels) is supported
|
| 79 |
+
# and runs only this specialist — 1/4 the cost of the full bag.
|
| 80 |
+
stems = apply_model(self.model, wav, device=str(self.device), progress=False)[0]
|
| 81 |
+
# stems: (n_sources, channels, samples). Only stems[target_index]
|
| 82 |
+
# is meaningful for this specialist — the other rows are weakly
|
| 83 |
+
# predicted by-products and should not be used.
|
| 84 |
+
|
| 85 |
+
return {
|
| 86 |
+
"bass": _audio_to_b64_wav(stems[self.target_index], self.sample_rate),
|
| 87 |
+
"sample_rate": self.sample_rate,
|
| 88 |
+
"duration_s": round(wav.shape[-1] / self.sample_rate, 3),
|
| 89 |
+
}
|
requirements.txt
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
torch>=2.2,<2.6
|
| 2 |
+
torchaudio>=2.2,<2.6
|
| 3 |
+
demucs==4.0.1
|
| 4 |
+
numpy>=1.26,<2.0
|
| 5 |
+
soundfile>=0.12
|