--- language: en license: mit library_name: demucs pipeline_tag: audio-to-audio tags: - demucs - stem-separation - source-separation - other-isolation - music - htdemucs - audio-to-audio - instrumental-extraction - karaoke - music-minus-one datasets: - StemSplitio/stem-separation-benchmark-2026 inference: false --- # HT-Demucs FT — Instrumental / Other Specialist (PyTorch) Melodic / instrumental specialist from HT-Demucs FT — everything that isn't vocals, drums, or bass. This is sub-model 2 of the 4-bag `htdemucs_ft` ensemble by [Défossez et al. (Meta AI)][demucs-repo], extracted as a standalone ~160 MB model. It produces the **other** stem with the same quality as the full ensemble (median SDR **6.34 dB** on MUSDB18-HQ — 2nd (close behind mdx_extra_q at 7.67) of all models in our 2026 benchmark) at roughly 1/4 the compute cost. > Want all 4 stems in one request? Use the full ensemble: > [`StemSplitio/htdemucs-ft-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch) > > Want a hosted REST API with credits and a dashboard? Use the > [**StemSplit API**](https://stemsplit.io/developers). --- ## Why this model | Property | This model | Full `htdemucs_ft` bag | |---|---|---| | Disk size | **~160 MB** | ~640 MB | | Per-3-min-song latency (M4 Pro MPS) | **~22 s** (RTF 0.12) | ~47 s (RTF 0.26) | | Instrumental / Other SDR on MUSDB18-HQ | **6.34 dB** | 6.34 dB *(identical — the bag's `other` output IS this sub-model's output)* | | Other stems returned | None (focused) | All 4 | If you only need the other stem in production, this is **strictly faster and smaller** than the full ensemble with identical other quality — **~2.6× faster wall time** in our smoke tests on M4 Pro MPS. --- ## Common use cases - **Karaoke / instrumental tracks** — extract the music-minus-vocals layer for karaoke mixes (use it with our `htdemucs-ft-pytorch` vocals model to round-trip) - **Sample-flipping** — isolate guitar/keys/synth lines for chopping and remixing - **Cover-song production** — remove vocals and rebalance the instrumental bed - **Music-bed for video** — strip vocals from licensed tracks for under-spoken-word use (check your sync rights first) --- ## Quick start (Python) ```python import base64, io, soundfile as sf from huggingface_hub import InferenceClient with open("your-song.mp3", "rb") as f: audio_b64 = base64.b64encode(f.read()).decode() client = InferenceClient(model="StemSplitio/htdemucs-ft-other-pytorch") result = client.post(json={"inputs": audio_b64}) wav, sr = sf.read(io.BytesIO(base64.b64decode(result["other"]))) sf.write("out_other.wav", wav, sr) ``` Or run locally without Hugging Face at all: ```python import torch, soundfile as sf from demucs.apply import apply_model from demucs.audio import convert_audio from demucs.pretrained import get_model bag = get_model("htdemucs_ft") model = bag.models[2].eval() # the other specialist wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True) wav = torch.from_numpy(wav.T).contiguous() wav = convert_audio(wav, sr, bag.samplerate, bag.audio_channels).unsqueeze(0) with torch.no_grad(): stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0] # bag.sources == ["drums", "bass", "other", "vocals"]; pick the other row sf.write("out_other.wav", stems[bag.sources.index("other")].T.numpy(), bag.samplerate) ``` --- ## Deploy on Hugging Face Inference Endpoints Click **Deploy → Inference Endpoints** above, pick a GPU instance, and HF will spin up a container running [`handler.py`](handler.py). | Hardware | Latency for 3-min song | |---|---:| | NVIDIA L4 | ~3 s | | NVIDIA T4 small | ~7 s | | CPU x4 (basic) | ~48 s | (Roughly 2.6× faster than the full-bag latency, since we run only this specialist sub-model. Cloud GPU numbers extrapolated from M4 Pro measurements.) ```bash curl -X POST https://.endpoints.huggingface.cloud \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d "{\"inputs\": \"$(base64 < your-song.mp3)\"}" ``` --- ## Try it in your browser, no code - [Karaoke Maker](https://stemsplit.io/karaoke-maker) - [Vocal Remover](https://stemsplit.io/vocal-remover) - [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter) - [StemSplit API](https://stemsplit.io/developers) - [Developer docs](https://stemsplit.io/developers/docs) - [API reference](https://stemsplit.io/developers/reference) --- ## Related models from StemSplit | Repo | Stem | When to use | |---|---|---| | [`htdemucs-ft-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch) | all 4 | When you need vocals + drums + bass + other in one request | | [`htdemucs-ft-vocals-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch) | vocals | Best vocal SDR in our benchmark (9.19 dB) — karaoke, acapella | | [`htdemucs-ft-drums-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-pytorch) | drums | Drum extraction, beat transcription, sample-pack creation | | [`htdemucs-ft-bass-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-pytorch) | bass | Bassline transcription, mix rebalancing | | [`htdemucs-ft-other-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-other-pytorch) | other / instrumental | Karaoke instrumentals, sample-flipping, music-bed extraction | Full benchmark across every popular open-source separator: [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026). --- ## License & attribution This repo is **MIT-licensed**, matching the original HT-Demucs. **Original authors (please cite if you use this model in research):** ```bibtex @inproceedings{rouard2023hybrid, title = {Hybrid Transformers for Music Source Separation}, author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre}, booktitle = {ICASSP}, year = {2023} } ``` - Original model: [`facebookresearch/demucs`][demucs-repo] - Packaging by [StemSplit](https://stemsplit.io) - Search keywords: instrumental extractor, karaoke maker, music minus vocals, AI instrumental separator [demucs-repo]: https://github.com/facebookresearch/demucs