--- license: apache-2.0 language: - ach - af - en - ee - ff - ha - ig - ki - rw - lgg - ln - lg - luo - nyn - st - sw - teo - tn - xh - yo library_name: transformers pipeline_tag: text-to-speech tags: - text-to-speech - tts - orpheus - multilingual - multi-speaker - african-languages - low-resource-language - sunbird - snac - unsloth - llama datasets: - Sunbird/tts base_model: unsloth/orpheus-3b-0.1-pretrained --- # Orpheus-3B Sunbird Multilingual TTS A multilingual, multi-speaker text-to-speech model fine-tuned from [`unsloth/orpheus-3b-0.1-pretrained`](https://huggingface.co/unsloth/orpheus-3b-0.1-pretrained) on the **full** [`Sunbird/tts`](https://huggingface.co/datasets/Sunbird/tts) corpus — 20 language configurations and every speaker present in the dataset. The model accepts arbitrary text and emits 24 kHz mono speech via the [SNAC](https://huggingface.co/hubertsiuzdak/snac_24khz) audio codec. Voice selection happens at the prompt level: prepend the chosen `speaker_id` followed by `": "` to your text, and the model produces audio in that speaker's voice. ## Quick links - **Base model:** [`unsloth/orpheus-3b-0.1-pretrained`](https://huggingface.co/unsloth/orpheus-3b-0.1-pretrained) (Llama-3 architecture) - **Audio codec:** [`hubertsiuzdak/snac_24khz`](https://huggingface.co/hubertsiuzdak/snac_24khz) (24 kHz, 7 codes per ~12 ms frame) - **Training dataset:** [`Sunbird/tts`](https://huggingface.co/datasets/Sunbird/tts) — all 20 configs, all speakers - **Training framework:** [Unsloth](https://github.com/unslothai/unsloth) + HuggingFace Trainer ## Languages covered Speaker IDs encode both the source corpus (`salt_*`, `waxal_*`, `slr32_*`, `slr129_*`, `bateesa_*`) and the language. Languages marked with an em dash in the Speaker IDs column are present in the model's training mix but do not currently expose individual voice IDs in this checkpoint. | Config | Language | ISO 639-1 | Region | Speaker IDs | |---|---|---|---|---| | `ach` | Acholi | — | Uganda, South Sudan | `salt_ach_0001`
`waxal_ach_0001`
`waxal_ach_0005`
`waxal_ach_0006`
`waxal_ach_0008` | | `afr` | Afrikaans | af | South Africa, Namibia | `slr32_afr_0009` | | `eng` | English | en | (control language) | `salt_eng_0001`
`salt_eng_0002`
`salt_eng_0003` | | `ewe` | Ewe | ee | Ghana, Togo | `slr129_ewe_0001` | | `ful` | Fulah | ff | West Africa (Sahel) | `waxal_ful_0003`
`waxal_ful_0004`
`waxal_ful_0006` | | `hau` | Hausa | ha | Nigeria, Niger, Chad | `waxal_hau_0004`
`waxal_hau_0006`
`waxal_hau_0007`
`waxal_hau_0008` | | `ibo` | Igbo | ig | Nigeria | `waxal_ibo_0003`
`waxal_ibo_0005`
`waxal_ibo_0008` | | `kik` | Kikuyu | ki | Kenya | `waxal_kik_0003`
`waxal_kik_0004` | | `kin` | Kinyarwanda | rw | Rwanda | `bateesa_kin_0001` | | `lgg` | Lugbara | — | Uganda, DRC | — | | `lin` | Lingala | ln | DRC, Republic of Congo | `slr129_lin_0001` | | `lug` | Luganda | lg | Uganda | `salt_lug_0001`
`waxal_lug_0002`
`waxal_lug_0003`
`waxal_lug_0004`
`waxal_lug_0005`
`waxal_lug_0006`
`waxal_lug_0007`
`waxal_lug_0008` | | `luo` | Luo (Dholuo) | — | Kenya, Tanzania | `waxal_luo_0001`
`waxal_luo_0002`
`waxal_luo_0003`
`waxal_luo_0004` | | `nyn` | Runyankole | — | Uganda | `salt_nyn_0001`
`waxal_nyn_0003`
`waxal_nyn_0004`
`waxal_nyn_0007`
`waxal_nyn_0008` | | `sot` | Sesotho | st | Lesotho, South Africa | — | | `swa` | Swahili | sw | East Africa | `waxal_swa_0006`
`waxal_swa_0007` | | `teo` | Ateso | — | Uganda, Kenya | `salt_teo_0001` | | `tsn` | Setswana | tn | Botswana, South Africa | — | | `xho` | Xhosa | xh | South Africa | `slr32_xho_0012` | | `yor` | Yoruba | yo | Nigeria, Benin | `waxal_yor_0002`
`waxal_yor_0006`
`waxal_yor_0008` | Per-language quality scales with the amount of training data Sunbird collected for that language; some configs have many more speaker hours than others. **Audition the test split** for each language before relying on a particular speaker — see the discovery snippet below. ## TL;DR ```python # After installing the dependencies (see "Inference" below) wav = synthesize("Mwattu, oli otya?", speaker_id="salt_lug_0001") # Luganda wav = synthesize("Habari yako rafiki.", speaker_id="salt_swa_0001") # Swahili wav = synthesize("Bawo ni, ọrẹ mi?", speaker_id="salt_yor_0001") # Yoruba ``` The model has no explicit "language" knob — the language identity travels via the speaker tag, since each `salt__` voice was recorded in exactly one language. --- ## Discovering speaker IDs The exact speaker_ids in each config can be enumerated from the dataset: ```python from collections import defaultdict from datasets import load_dataset, get_dataset_config_names CONFIGS = get_dataset_config_names("Sunbird/tts") # the 20 languages speakers_by_lang = defaultdict(set) for cfg in CONFIGS: ds = load_dataset("Sunbird/tts", cfg, split="train") for sid in ds["speaker_id"]: speakers_by_lang[cfg].add(sid) for cfg, sids in sorted(speakers_by_lang.items()): print(f"{cfg}: {len(sids)} speaker(s) — {sorted(sids)[:3]}{'...' if len(sids) > 3 else ''}") ``` Speaker IDs follow the pattern `salt__` (e.g., `salt_lug_0001`, `salt_ach_0007`). Pass any one of them as `speaker_id` to either inference function below. --- ## Inference The model wraps every prompt in a multi-speaker tagged format: ``` [SOH] + tokenize(": ") + [EOT, EOH] ``` and the model autoregressively emits Llama-3 special tokens followed by SNAC audio codes that decode to a 24 kHz waveform. Two reference implementations follow. ### Option A — `transformers` + `unsloth` (single request) Best for development, notebook-driven iteration, and small batch sizes. **Install:** ```bash pip install unsloth snac soundfile torchcodec "datasets>=3.4.1,<4.0.0" ``` **Run:** ```python import os import numpy as np import torch import soundfile as sf from unsloth import FastLanguageModel from snac import SNAC MODEL_ID = "sunbird/orpheus-3b-tts-multilingual" # Special tokens — must match the training format END_OF_TEXT = 128009 START_OF_SPEECH = 128257 END_OF_SPEECH = 128258 START_OF_HUMAN = 128259 END_OF_HUMAN = 128260 PAD_TOKEN = 128263 AUDIO_TOKEN_LO = 128266 AUDIO_TOKEN_HI = 128266 + 7 * 4096 # exclusive # 1) Load the LM (LoRA already merged into 16-bit weights at training time) model, tokenizer = FastLanguageModel.from_pretrained( model_name = MODEL_ID, max_seq_length = 4096, dtype = None, # auto bf16 / fp16 load_in_4bit = False, # set True to halve VRAM at slight quality cost token = os.environ.get("HF_TOKEN"), ) FastLanguageModel.for_inference(model) # 2) Load SNAC decoder (CPU is fine — frees GPU for the LM) snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to("cpu") def _redistribute_codes(code_list: list[int]) -> torch.Tensor: layer_1, layer_2, layer_3 = [], [], [] for i in range(len(code_list) // 7): layer_1.append(code_list[7*i]) layer_2.append(code_list[7*i + 1] - 4096) layer_3.append(code_list[7*i + 2] - 2*4096) layer_3.append(code_list[7*i + 3] - 3*4096) layer_2.append(code_list[7*i + 4] - 4*4096) layer_3.append(code_list[7*i + 5] - 5*4096) layer_3.append(code_list[7*i + 6] - 6*4096) if not layer_1: return torch.zeros(1, 1, 12000) # ~0.5s silence fallback clamp = lambda vals: [max(0, min(4095, v)) for v in vals] codes = [torch.tensor(clamp(layer_1)).unsqueeze(0), torch.tensor(clamp(layer_2)).unsqueeze(0), torch.tensor(clamp(layer_3)).unsqueeze(0)] return snac_model.decode(codes) def synthesize(text: str, speaker_id: str, *, max_new_tokens: int = 1200, temperature: float = 0.6, top_p: float = 0.95, repetition_penalty: float = 1.1, seed: int | None = None) -> np.ndarray: """Synthesize speech for `text` in the voice of `speaker_id`. `speaker_id` must be one of the speakers seen during training, e.g. "salt_lug_0001" (Luganda) or "salt_swa_0003" (Swahili). """ if seed is not None: torch.manual_seed(seed) tagged = f"{speaker_id}: {text}" text_ids = tokenizer(tagged, return_tensors="pt").input_ids soh = torch.tensor([[START_OF_HUMAN]], dtype=torch.int64) end = torch.tensor([[END_OF_TEXT, END_OF_HUMAN]], dtype=torch.int64) input_ids = torch.cat([soh, text_ids, end], dim=1).to("cuda") attention_mask = torch.ones_like(input_ids) generated = model.generate( input_ids = input_ids, attention_mask = attention_mask, max_new_tokens = max_new_tokens, do_sample = True, temperature = temperature, top_p = top_p, repetition_penalty = repetition_penalty, eos_token_id = END_OF_SPEECH, use_cache = True, ) # Crop on last SOS, filter to audio token range, redistribute, decode sos_indices = (generated == START_OF_SPEECH).nonzero(as_tuple=True) cropped = generated[:, sos_indices[1][-1].item() + 1:] if len(sos_indices[1]) > 0 else generated row = cropped[0] audio_only = row[(row >= AUDIO_TOKEN_LO) & (row < AUDIO_TOKEN_HI)] n = (audio_only.size(0) // 7) * 7 code_list = [t.item() - AUDIO_TOKEN_LO for t in audio_only[:n]] waveform = _redistribute_codes(code_list) return waveform.detach().squeeze().to("cpu").numpy().astype(np.float32) # 3) Use it — pick a speaker per language wav = synthesize("Mwattu, Mukama yeebazibwe.", speaker_id="salt_lug_0001", seed=42) sf.write("luganda.wav", wav, 24000) wav = synthesize("Habari yako rafiki.", speaker_id="salt_swa_0001", seed=42) sf.write("swahili.wav", wav, 24000) ``` ### Option B — `vllm` (high throughput, batched, deployment) Best for serving traffic. PagedAttention + continuous batching gives roughly **5–10× faster** single-request latency and **10–100× higher** throughput on batched requests vs. the `transformers` path. Multi-speaker batching (different `speaker_id`s in one call) gets the full benefit. > **Important:** vLLM ships its own torch/transformers and conflicts > with Unsloth's pinned versions. Use a fresh Python environment for > vLLM serving — do not install on top of an Unsloth env. **Install:** ```bash pip install vllm snac soundfile torchcodec "datasets>=3.4.1,<4.0.0" ``` **Run:** ```python import os import numpy as np import torch import soundfile as sf from snac import SNAC from transformers import AutoTokenizer from vllm import LLM, SamplingParams MODEL_ID = "sunbird/orpheus-3b-tts-multilingual" END_OF_TEXT = 128009 START_OF_SPEECH = 128257 END_OF_SPEECH = 128258 START_OF_HUMAN = 128259 END_OF_HUMAN = 128260 AUDIO_TOKEN_LO = 128266 AUDIO_TOKEN_HI = 128266 + 7 * 4096 # 1) Load llm = LLM( model = MODEL_ID, dtype = "bfloat16", max_model_len = 4096, gpu_memory_utilization = 0.85, ) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=os.environ.get("HF_TOKEN")) snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to("cpu") def _build_prompt_token_ids(text: str, speaker_id: str) -> list[int]: tagged = f"{speaker_id}: {text}" text_ids = tokenizer.encode(tagged, add_special_tokens=True) return [START_OF_HUMAN] + text_ids + [END_OF_TEXT, END_OF_HUMAN] def _codes_to_waveform(generated_token_ids: list[int]) -> np.ndarray: ids = torch.tensor(generated_token_ids, dtype=torch.int64) sos_pos = (ids == START_OF_SPEECH).nonzero(as_tuple=True)[0] if len(sos_pos) > 0: ids = ids[sos_pos[-1].item() + 1:] audio = ids[(ids >= AUDIO_TOKEN_LO) & (ids < AUDIO_TOKEN_HI)] n = (audio.size(0) // 7) * 7 cl = [t.item() - AUDIO_TOKEN_LO for t in audio[:n]] l1, l2, l3 = [], [], [] for i in range(len(cl) // 7): l1.append(cl[7*i]) l2.append(cl[7*i+1] - 4096); l3.append(cl[7*i+2] - 2*4096) l3.append(cl[7*i+3] - 3*4096); l2.append(cl[7*i+4] - 4*4096) l3.append(cl[7*i+5] - 5*4096); l3.append(cl[7*i+6] - 6*4096) if not l1: return np.zeros(12000, dtype=np.float32) cb = lambda v: [max(0, min(4095, x)) for x in v] codes = [torch.tensor(cb(l1)).unsqueeze(0), torch.tensor(cb(l2)).unsqueeze(0), torch.tensor(cb(l3)).unsqueeze(0)] return snac_model.decode(codes).detach().squeeze().cpu().numpy().astype(np.float32) def synthesize(text: str, speaker_id: str, *, max_tokens: int = 1200, temperature: float = 0.6, top_p: float = 0.95, repetition_penalty: float = 1.1, seed: int | None = None) -> np.ndarray: sp = SamplingParams( temperature = temperature, top_p = top_p, repetition_penalty = repetition_penalty, max_tokens = max_tokens, stop_token_ids = [END_OF_SPEECH], skip_special_tokens = False, seed = seed, ) pids = _build_prompt_token_ids(text, speaker_id) out = llm.generate([{"prompt_token_ids": pids}], sp) return _codes_to_waveform(list(out[0].outputs[0].token_ids)) def synthesize_batch(items: list[dict], **kwargs) -> list[np.ndarray]: """items: list of {"text": str, "speaker_id": str} — different speakers can be mixed in one batch.""" sp = SamplingParams( temperature = kwargs.get("temperature", 0.6), top_p = kwargs.get("top_p", 0.95), repetition_penalty = kwargs.get("repetition_penalty", 1.1), max_tokens = kwargs.get("max_tokens", 1200), stop_token_ids = [END_OF_SPEECH], skip_special_tokens = False, seed = kwargs.get("seed"), ) prompts = [{"prompt_token_ids": _build_prompt_token_ids(it["text"], it["speaker_id"])} for it in items] outputs = llm.generate(prompts, sp) return [_codes_to_waveform(list(o.outputs[0].token_ids)) for o in outputs] # 2) Single — pick a speaker per language wav = synthesize("Mwattu, oli otya?", speaker_id="salt_lug_0001", seed=42) sf.write("luganda.wav", wav, 24000) # 3) Batched — different languages and speakers in one GPU pass items = [ {"text": "Mwattu, oli otya?", "speaker_id": "salt_lug_0001"}, {"text": "Habari yako rafiki.", "speaker_id": "waxal_swa_0006"}, {"text": "Bawo ni, ọrẹ mi?", "speaker_id": "waxal_yor_0002"}, {"text": "Sannu, ina kwana?", "speaker_id": "waxal_hau_0004"}, {"text": "Goeie môre, hoe gaan dit?", "speaker_id": "slr32_afr_0009"}, ] wavs = synthesize_batch(items, seed=123) for i, (it, w) in enumerate(zip(items, wavs)): sf.write(f"batch_{i:02d}_{it['speaker_id']}.wav", w, 24000) ``` ### Generation parameters | Param | Default | What it does | |---|---|---| | `temperature` | 0.6 | Lower = more deterministic, slightly flatter prosody. | | `top_p` | 0.95 | Nucleus sampling. Don't drop below 0.9 — produces robotic audio. | | `repetition_penalty` | 1.1 | Discourages stuck-on-one-frame artefacts. 1.0 disables it. | | `max_new_tokens` / `max_tokens` | 1200 | ≈ 9–10 s of audio. Raise for longer utterances. | | `seed` | `None` | Pass an int for reproducible output across runs. | --- ## Token format The tokenizer is Llama-3's, with Orpheus's audio-codebook special tokens laid out above the standard text vocabulary: | Token | ID | Purpose | |---|---|---| | `<\|begin_of_text\|>` | 128000 | Llama-3 BOS (auto-prepended by tokenizer) | | `<\|end_of_text\|>` | 128009 | end of human turn (text portion) | | `START_OF_SPEECH` | 128257 | model emits this just before audio codes | | `END_OF_SPEECH` | 128258 | model emits this when it finishes — used as `eos_token_id` / `stop_token_ids` | | `START_OF_HUMAN` | 128259 | wrap the text prompt | | `END_OF_HUMAN` | 128260 | wrap the text prompt | | `START_OF_AI` | 128261 | model emits this to begin its response | | `END_OF_AI` | 128262 | model emits this when fully done | | `PAD_TOKEN` | 128263 | left-padding for batched generation | | audio codebook | 128266 + N·4096 | SNAC codes, N ∈ {0..6} for 7-frame layout | **Training prompt structure** (and what the model expects at inference): ``` [SOH] + tokenize("salt__: ") + [EOT] + [EOH] ↳ model autoregressively emits: [SOA] + [SOS] + audio_codes... + [EOS] + [EOA] ``` To recover audio: find the **last** `START_OF_SPEECH` (128257) in the output, take everything after it, drop any token outside the audio codebook range, group into 7-token frames, undo the per-position offsets, and feed the three layers to `SNAC.decode`. Both inference snippets above implement this end-to-end. --- ## Training details | Setting | Value | |---|---| | Base model | `unsloth/orpheus-3b-0.1-pretrained` (raw pretrained, not the `-ft` voice-actor variant) | | Adapter | LoRA r=64, α=64, dropout=0, bias=none | | Target modules | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | | Optimizer | `adamw_8bit`, weight decay 0.001 | | LR schedule | linear, lr=2e-4, warmup steps=5 | | Per-device batch size | 1 (with `gradient_accumulation_steps=4`, effective batch = 4) | | Epochs | 3 | | `max_seq_length` | 4096 | | `save_total_limit` | 2 | | Precision | bfloat16 weights, 16-bit LoRA | | Seed | 3407 | | Hardware | single NVIDIA RTX 4090 (24 GB) | | Gradient checkpointing | Unsloth's optimised variant | | Final save | LoRA merged into 16-bit weights via `save_pretrained_merged(save_method="merged_16bit")` | The pretrained variant of Orpheus was chosen over the `-ft` voice-actor variant because that variant has a strong English-voice-actor prior that fights low-resource-language fine-tuning. ### Data prep summary 1. Load all 20 configs of `Sunbird/tts` (`get_dataset_config_names`) and `concatenate_datasets` their `train` and `test` splits into one training set and one held-out evaluation set. **No** speaker filter. 2. Tag each row with `source = example["speaker_id"]` (per-row, not constant) — the model learns the multi-speaker prompt format `f"{speaker_id}: {text}"` across every speaker it sees. 3. Cast `audio` to 24 kHz via `Audio(sampling_rate=24000)`. 4. Drop rows whose tokenised text alone exceeds `max_seq_length` — saves expensive SNAC encoding on rows that would be filtered out downstream. 5. Encode each remaining audio clip with `hubertsiuzdak/snac_24khz` → 7 codes per frame, flattened with per-layer offsets `(+128266, +4096, +2·4096, …)`. 6. Filter out rows with empty/None codes; drop consecutive duplicate frames. 7. Build `input_ids = [SOH] + text_ids + [EOT] + [EOH] + [SOA] + [SOS] + audio_codes + [EOS] + [EOA]`. 8. Drop rows whose total tokenised length exceeds `max_seq_length` (safety net for rows where text fits but text + audio together overflow the budget). --- ## Evaluation Quality was evaluated qualitatively on a diverse held-out test sample: during training, up to 10 utterances are pulled from `ds_test.shuffle(seed=42)` covering as many distinct speaker_ids as possible. Generated audio is saved next to the ground-truth recording under `inference_samples/sample__.wav` so each language / voice combination can be auditioned individually. We did **not** run automated metrics (WER on a downstream STT, MOS prediction, language-confusion eval, etc.) for this release. Numbers will be added if/when those become part of the evaluation pipeline. **Important caveat — quality varies by language.** The training corpus is unbalanced across the 20 configs; languages with more speaker hours in `Sunbird/tts` get more training signal and produce more natural speech. Audition the per-language samples before relying on a specific voice for production traffic. --- ## Intended uses & out-of-scope **Intended:** - Multilingual voice synthesis for accessibility, language learning, human–computer interaction, audio content creation, and downstream speech research on the 20 covered languages. - A reference checkpoint for the Sunbird/tts → Orpheus-3B multilingual fine-tuning pipeline; reproducible training recipe in [`Orpheus_3B_Sunbird_Multilingual.ipynb`](https://github.com/SunbirdAI/Qwen3-TTS/blob/main/orpheus-3B/Orpheus_3B_Sunbird_Multilingual.ipynb). **Out of scope:** - **Voice impersonation / deception.** The model imitates the timbres of consenting Sunbird voice donors. Do not use the generated audio to impersonate identifiable real persons or to produce content that could mislead listeners about who is speaking. - **High-stakes decisions.** Generated speech may contain pronunciation errors, prosodic artefacts, or hallucinated phrases — do not deploy in safety-critical contexts (medical, legal, emergency) without human review. - **Languages outside the 20 configs.** The model has no signal for languages not present in `Sunbird/tts`; sending German text to any speaker will produce garbled output, not "German with a Luganda accent". - **Code-switching.** Each speaker_id was recorded in a single language; the model has not seen mixed-language utterances and will likely produce phonetic artefacts at language boundaries within one prompt. - **Cross-language voice transfer.** Sending Acholi text to `salt_lug_0001` (a Luganda speaker_id) is undefined behaviour. The model has no language-conditioning input separate from the speaker tag, so language identity travels via the speaker_id. Use a speaker whose `salt__NNNN` prefix matches the language of your text. --- ## Limitations & risks - **Quality varies by language.** Per-language data volume in `Sunbird/tts` is unbalanced. Languages with fewer hours produce noticeably less natural speech. Run the per-language test-split audit (script below) before committing to a particular voice. - **No language conditioning.** There is no `language` token; the model relies entirely on the speaker_id to disambiguate. Mismatching speaker_id and text language is undefined behaviour (see above). - **Vocabulary coverage.** Limited to the lexicon present in each config's training subset. Unfamiliar words, code-switching, and out-of-distribution proper nouns may produce artefacts. - **Long utterances.** The model was trained on utterances up to ~16 s of audio (`max_seq_length=4096`). Generation may degrade or truncate beyond ~10 s of speech. - **Sampling variance.** With `do_sample=True`, identical prompts can produce noticeably different deliveries between runs. Pass `seed=` for reproducibility. - **No emotion/style control.** Unlike the upstream `orpheus-3b-0.1-ft`, this fine-tune was not exposed to in-text emotion tags (``, ``, …). Such tags will be tokenised as ordinary text and produce no special prosodic effect. - **Bias.** Inherits any biases present in the Sunbird/tts corpus and in Llama-3's pretraining; we have not audited these systematically per language. ### Quick per-language audit script ```python from datasets import load_dataset, Audio, get_dataset_config_names import soundfile as sf from pathlib import Path CONFIGS = get_dataset_config_names("Sunbird/tts") out_dir = Path("language_audit"); out_dir.mkdir(exist_ok=True) for cfg in CONFIGS: ds = load_dataset("Sunbird/tts", cfg, split="test") ds = ds.cast_column("audio", Audio(sampling_rate=24000)) row = ds[0] sid, text = row["speaker_id"], row["text"] print(f"{cfg}: {sid} -> {text[:80]}") wav = synthesize(text, speaker_id=sid, seed=0) sf.write(out_dir / f"{cfg}_{sid}.wav", wav, 24000) sf.write(out_dir / f"{cfg}_{sid}_groundtruth.wav", row["audio"]["array"], 24000) ``` --- ## Hardware requirements | Mode | Min VRAM | Recommended | |---|---|---| | `transformers` + Unsloth, fp16 | 8 GB (with `load_in_4bit=True`) | 16 GB | | `transformers` + Unsloth, bf16 | 14 GB | 24 GB | | vLLM, bf16, `max_model_len=4096` | 14 GB | 24 GB | Audio decoding via SNAC runs on CPU and adds ~50–150 ms per utterance. --- ## License & attribution This fine-tune is released under **Apache-2.0**, matching the upstream [`unsloth/orpheus-3b-0.1-pretrained`](https://huggingface.co/unsloth/orpheus-3b-0.1-pretrained) license. It transitively inherits obligations from: - The [Orpheus-TTS](https://github.com/canopyai/Orpheus-TTS) project (CanopyAI). - The [Llama-3](https://llama.meta.com/llama3/) base architecture and weights — Meta Llama 3 Community License. - The [SNAC](https://github.com/hubertsiuzdak/snac) audio codec (Hubert Siuzdak, MIT). - The [`Sunbird/tts`](https://huggingface.co/datasets/Sunbird/tts) dataset and the SALT voice donors who contributed recordings. If you redistribute the merged weights, please carry these attributions forward. --- ## Citation If you use this model in your work, please cite both the dataset and the fine-tuning project: ```bibtex @misc{sunbird_orpheus3b_multilingual_2026, title = {Orpheus-3B Sunbird Multilingual TTS}, author = {Sunbird AI}, year = {2026}, howpublished = {\url{https://huggingface.co/sunbird/orpheus-3b-tts-multilingual}}, } @misc{sunbird_tts_dataset, title = {Sunbird Speech Dataset}, author = {Sunbird AI}, howpublished = {\url{https://huggingface.co/datasets/Sunbird/tts}}, } @misc{orpheus_tts_2025, title = {Orpheus-TTS}, author = {Canopy Labs}, year = {2025}, howpublished = {\url{https://github.com/canopyai/Orpheus-TTS}}, } ``` --- ## Single-speaker variant If you only need one specific voice and want a smaller, more focused checkpoint, see [`sunbird/orpheus-3b-tts-salt-lug-0001`](https://huggingface.co/sunbird/orpheus-3b-tts-salt-lug-0001) — same recipe, scoped to a single Luganda speaker.