StemSplit commited on
Commit
99bf071
·
verified ·
1 Parent(s): 40e9223

Initial release: htdemucs_ft bass specialist (PyTorch handler)

Browse files
Files changed (3) hide show
  1. README.md +169 -0
  2. handler.py +89 -0
  3. requirements.txt +5 -0
README.md ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ library_name: demucs
5
+ pipeline_tag: audio-to-audio
6
+ tags:
7
+ - demucs
8
+ - stem-separation
9
+ - source-separation
10
+ - bass-isolation
11
+ - music
12
+ - htdemucs
13
+ - audio-to-audio
14
+ - bass-extraction
15
+ - bass-isolation
16
+ - bassline-extraction
17
+ datasets:
18
+ - StemSplitio/stem-separation-benchmark-2026
19
+ inference: false
20
+ ---
21
+
22
+ # HT-Demucs FT — Bass Specialist (PyTorch)
23
+
24
+ Bass isolation specialist from HT-Demucs FT, ~1/4 the size of the full ensemble.
25
+
26
+ This is sub-model 1 of the 4-bag `htdemucs_ft` ensemble by
27
+ [Défossez et al. (Meta AI)][demucs-repo], extracted as a standalone
28
+ ~160 MB model. It produces the **bass** stem with the same quality as
29
+ the full ensemble (median SDR **10.38 dB** on MUSDB18-HQ — 2nd (close behind mdx_extra_q at 11.42) of all
30
+ models in our 2026 benchmark) at roughly 1/4 the compute cost.
31
+
32
+ > Want all 4 stems in one request? Use the full ensemble:
33
+ > [`StemSplitio/htdemucs-ft-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch)
34
+ >
35
+ > Want a hosted REST API with credits and a dashboard? Use the
36
+ > [**StemSplit API**](https://stemsplit.io/developers).
37
+
38
+ ---
39
+
40
+ ## Why this model
41
+
42
+ | Property | This model | Full `htdemucs_ft` bag |
43
+ |---|---|---|
44
+ | Disk size | **~160 MB** | ~640 MB |
45
+ | Per-3-min-song latency (M4 Pro MPS) | **~22 s** (RTF 0.12) | ~47 s (RTF 0.26) |
46
+ | Bass SDR on MUSDB18-HQ | **10.38 dB** | 10.38 dB *(identical — the bag's `bass` output IS this sub-model's output)* |
47
+ | Other stems returned | None (focused) | All 4 |
48
+
49
+ If you only need the bass stem in production, this is **strictly faster and
50
+ smaller** than the full ensemble with identical bass quality —
51
+ **~2.6× faster wall time** in our smoke tests on M4 Pro MPS.
52
+
53
+ ---
54
+
55
+ ## Common use cases
56
+
57
+ - **Bassline transcription** — extract bass for tab generation, MIDI conversion, or chord detection
58
+ - **Mix rebalancing** — isolate and re-equalise the bass bus on a finished mix
59
+ - **Music education** — learn basslines from any record by hearing them isolated
60
+ - **Sub-bass mastering reference** — compare your low-end against pro mixes
61
+
62
+ ---
63
+
64
+ ## Quick start (Python)
65
+
66
+ ```python
67
+ import base64, io, soundfile as sf
68
+ from huggingface_hub import InferenceClient
69
+
70
+ with open("your-song.mp3", "rb") as f:
71
+ audio_b64 = base64.b64encode(f.read()).decode()
72
+
73
+ client = InferenceClient(model="StemSplitio/htdemucs-ft-bass-pytorch")
74
+ result = client.post(json={"inputs": audio_b64})
75
+
76
+ wav, sr = sf.read(io.BytesIO(base64.b64decode(result["bass"])))
77
+ sf.write("out_bass.wav", wav, sr)
78
+ ```
79
+
80
+ Or run locally without Hugging Face at all:
81
+
82
+ ```python
83
+ import torch, soundfile as sf
84
+ from demucs.apply import apply_model
85
+ from demucs.audio import convert_audio
86
+ from demucs.pretrained import get_model
87
+
88
+ bag = get_model("htdemucs_ft")
89
+ model = bag.models[1].eval() # the bass specialist
90
+ wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
91
+ wav = torch.from_numpy(wav.T).contiguous()
92
+ wav = convert_audio(wav, sr, bag.samplerate, bag.audio_channels).unsqueeze(0)
93
+
94
+ with torch.no_grad():
95
+ stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]
96
+
97
+ # bag.sources == ["drums", "bass", "other", "vocals"]; pick the bass row
98
+ sf.write("out_bass.wav", stems[bag.sources.index("bass")].T.numpy(), bag.samplerate)
99
+ ```
100
+
101
+ ---
102
+
103
+ ## Deploy on Hugging Face Inference Endpoints
104
+
105
+ Click **Deploy → Inference Endpoints** above, pick a GPU instance, and HF
106
+ will spin up a container running [`handler.py`](handler.py).
107
+
108
+ | Hardware | Latency for 3-min song |
109
+ |---|---:|
110
+ | NVIDIA L4 | ~3 s |
111
+ | NVIDIA T4 small | ~7 s |
112
+ | CPU x4 (basic) | ~48 s |
113
+
114
+ (Roughly 2.6× faster than the full-bag latency, since we run only this
115
+ specialist sub-model. Cloud GPU numbers extrapolated from M4 Pro measurements.)
116
+
117
+ ```bash
118
+ curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
119
+ -H "Authorization: Bearer $HF_TOKEN" \
120
+ -H "Content-Type: application/json" \
121
+ -d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"
122
+ ```
123
+
124
+ ---
125
+
126
+ ## Try it in your browser, no code
127
+
128
+ - [StemSplit](https://stemsplit.io)
129
+ - [StemSplit API](https://stemsplit.io/developers)
130
+ - [Developer docs](https://stemsplit.io/developers/docs)
131
+ - [API reference](https://stemsplit.io/developers/reference)
132
+
133
+ ---
134
+
135
+ ## Related models from StemSplit
136
+
137
+ | Repo | Stem | When to use |
138
+ |---|---|---|
139
+ | [`htdemucs-ft-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch) | all 4 | When you need vocals + drums + bass + other in one request |
140
+ | [`htdemucs-ft-vocals-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch) | vocals | Best vocal SDR in our benchmark (9.19 dB) — karaoke, acapella |
141
+ | [`htdemucs-ft-drums-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-pytorch) | drums | Drum extraction, beat transcription, sample-pack creation |
142
+ | [`htdemucs-ft-bass-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-pytorch) | bass | Bassline transcription, mix rebalancing |
143
+ | [`htdemucs-ft-other-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-other-pytorch) | other / instrumental | Karaoke instrumentals, sample-flipping, music-bed extraction |
144
+
145
+ Full benchmark across every popular open-source separator:
146
+ [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026).
147
+
148
+ ---
149
+
150
+ ## License & attribution
151
+
152
+ This repo is **MIT-licensed**, matching the original HT-Demucs.
153
+
154
+ **Original authors (please cite if you use this model in research):**
155
+
156
+ ```bibtex
157
+ @inproceedings{rouard2023hybrid,
158
+ title = {Hybrid Transformers for Music Source Separation},
159
+ author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
160
+ booktitle = {ICASSP},
161
+ year = {2023}
162
+ }
163
+ ```
164
+
165
+ - Original model: [`facebookresearch/demucs`][demucs-repo]
166
+ - Packaging by [StemSplit](https://stemsplit.io)
167
+ - Search keywords: bass extraction, isolate bass from song, bassline extractor, AI bass separator
168
+
169
+ [demucs-repo]: https://github.com/facebookresearch/demucs
handler.py ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HF Inference Endpoint handler for the HT-Demucs FT **bass** specialist.
3
+
4
+ This repo ships only sub-model 1 of the 4-bag htdemucs_ft ensemble
5
+ — the one trained to extract `bass`. ~160 MB on disk and ~1/4 the inference
6
+ cost of the full bag, with the same per-stem quality as our v1.1 benchmark
7
+ (median bass SDR = 10.38 dB).
8
+
9
+ If you need all 4 stems in one request, use the full ensemble:
10
+ https://huggingface.co/StemSplitio/htdemucs-ft-pytorch
11
+
12
+ Request shape:
13
+ POST /
14
+ Content-Type: application/json
15
+ { "inputs": "<base64-encoded audio bytes>" }
16
+
17
+ Response shape:
18
+ { "bass": "<base64 WAV>", "sample_rate": 44100, "duration_s": 123.4 }
19
+ """
20
+ from __future__ import annotations
21
+
22
+ import base64
23
+ import io
24
+ from typing import Any
25
+
26
+ import numpy as np
27
+ import soundfile as sf
28
+ import torch
29
+ from demucs.apply import apply_model
30
+ from demucs.audio import convert_audio
31
+ from demucs.pretrained import get_model
32
+
33
+ # Which sub-model of the htdemucs_ft bag to ship + which output index is ours.
34
+ BAG_INDEX = 1
35
+ TARGET_STEM = "bass"
36
+
37
+
38
+ def _audio_to_b64_wav(audio: torch.Tensor, sample_rate: int) -> str:
39
+ np_audio = np.clip(audio.cpu().numpy().T, -1.0, 1.0)
40
+ buf = io.BytesIO()
41
+ sf.write(buf, np_audio, sample_rate, subtype="PCM_16", format="WAV")
42
+ return base64.b64encode(buf.getvalue()).decode("ascii")
43
+
44
+
45
+ class EndpointHandler:
46
+ def __init__(self, path: str = "") -> None:
47
+ # Load the full bag, then drop the other 3 sub-models so only the
48
+ # bass specialist stays in memory.
49
+ bag = get_model("htdemucs_ft")
50
+ self.model = bag.models[BAG_INDEX]
51
+ self.model.eval()
52
+ self.device = torch.device(
53
+ "cuda" if torch.cuda.is_available() else
54
+ "mps" if torch.backends.mps.is_available() else
55
+ "cpu"
56
+ )
57
+ self.model.to(self.device)
58
+ self.sample_rate = int(bag.samplerate)
59
+ self.audio_channels = int(bag.audio_channels)
60
+ self.sources = list(bag.sources) # ["drums","bass","other","vocals"]
61
+ self.target_index = self.sources.index(TARGET_STEM)
62
+
63
+ def __call__(self, data: dict[str, Any]) -> dict[str, Any]:
64
+ if "inputs" not in data:
65
+ return {"error": "Request body must include base64 audio under 'inputs'."}
66
+
67
+ try:
68
+ audio_bytes = base64.b64decode(data["inputs"])
69
+ wav_np, sr = sf.read(io.BytesIO(audio_bytes), dtype="float32", always_2d=True)
70
+ except Exception as e: # noqa: BLE001
71
+ return {"error": f"Could not decode audio: {type(e).__name__}: {e}"}
72
+
73
+ wav = torch.from_numpy(wav_np.T).contiguous()
74
+ wav = convert_audio(wav, sr, self.sample_rate, self.audio_channels)
75
+ wav = wav.unsqueeze(0).to(self.device)
76
+
77
+ with torch.no_grad():
78
+ # apply_model on a single Model (not a BagOfModels) is supported
79
+ # and runs only this specialist — 1/4 the cost of the full bag.
80
+ stems = apply_model(self.model, wav, device=str(self.device), progress=False)[0]
81
+ # stems: (n_sources, channels, samples). Only stems[target_index]
82
+ # is meaningful for this specialist — the other rows are weakly
83
+ # predicted by-products and should not be used.
84
+
85
+ return {
86
+ "bass": _audio_to_b64_wav(stems[self.target_index], self.sample_rate),
87
+ "sample_rate": self.sample_rate,
88
+ "duration_s": round(wav.shape[-1] / self.sample_rate, 3),
89
+ }
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ torch>=2.2,<2.6
2
+ torchaudio>=2.2,<2.6
3
+ demucs==4.0.1
4
+ numpy>=1.26,<2.0
5
+ soundfile>=0.12