--- license: other library_name: pytorch pipeline_tag: text-to-speech base_model: MisoLabs/MisoTTS base_model_relation: quantized tags: - text-to-speech - speech-synthesis - voice - audio - sesame - mimi - llama - quantized - torchao - int4 - 4-bit - weight-only - base_model:quantized:MisoLabs/MisoTTS --- # MisoTTS 8B TorchAO INT4 Weight-Only Quantization This repository contains a 4-bit TorchAO weight-only quantization of [`MisoLabs/MisoTTS`](https://huggingface.co/MisoLabs/MisoTTS), packaged so it can be loaded without first materializing the full 32 GB F32 checkpoint. - **Base model:** `MisoLabs/MisoTTS` - **Quantization:** TorchAO `Int4WeightOnlyConfig(group_size=128)` - **Runtime format:** `torch.save` checkpoint containing TorchAO quantized tensor subclasses - **Tested GPU:** RTX 3060 12 GB - **Tokenizer:** upstream default `meta-llama/Llama-3.2-1B` - **Language:** English, following the base model No private prompt voice is included. Voice continuation/cloning requires user-supplied prompt audio and transcript. ## Why this exists The upstream MisoTTS checkpoint is large and the default loader materializes F32 weights. This quantized variant targets consumer GPUs around 12 GB VRAM. It has been smoke-tested locally on an RTX 3060 using short and longer expressive generations. ## Install Use Python 3.10 and the same dependency family as upstream MisoTTS. A practical setup is: ```bash git clone https://huggingface.co/droyster/MisoTTS-8B-torchao-int4 cd MisoTTS-8B-torchao-int4 python3.10 -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` The loader uses upstream MisoTTS tokenizer behavior by default: `meta-llama/Llama-3.2-1B`. This requires a Hugging Face token/account that has access to Meta's Llama 3.2 tokenizer repo. ## Quick smoke test ```bash python scripts/smoke_test.py \ --repo-id droyster/MisoTTS-8B-torchao-int4 \ --output smoke.wav \ --disable-watermark ``` `--disable-watermark` is recommended on 12 GB GPUs for longer local evaluation runs because SilentCipher watermarking can add enough memory pressure to OOM. ## Python usage ```python import torchaudio from load_quantized import load_miso_8b_torchao_int4 # disable_watermark=True is useful on 12 GB GPUs for long generations. generator = load_miso_8b_torchao_int4( "droyster/MisoTTS-8B-torchao-int4", device="cuda", disable_watermark=True, ) audio = generator.generate( text="Hello from the four bit TorchAO quantized Miso TTS model.", speaker=0, context=[], max_audio_length_ms=10_000, temperature=0.8, topk=40, ) torchaudio.save("miso_int4.wav", audio.unsqueeze(0).cpu(), generator.sample_rate) ``` ## Prompted voice continuation This repo does **not** include any voice prompt audio. To condition on a user-supplied voice, pass context segments exactly as upstream MisoTTS does: ```python import torchaudio from generator import Segment from load_quantized import load_miso_8b_torchao_int4 generator = load_miso_8b_torchao_int4("droyster/MisoTTS-8B-torchao-int4", device="cuda") prompt_audio, sr = torchaudio.load("prompt.wav") prompt_audio = prompt_audio.mean(dim=0) if sr != generator.sample_rate: prompt_audio = torchaudio.functional.resample(prompt_audio, sr, generator.sample_rate) context = [Segment( speaker=0, text="Transcript of the prompt audio goes here.", audio=prompt_audio, )] audio = generator.generate( text="The next sentence to synthesize.", speaker=0, context=context, max_audio_length_ms=10_000, ) ``` ## Known limitations - Long generations can drift from a short voice prompt; use longer/better prompt context for stronger voice adherence. - SilentCipher watermarking may OOM on 12 GB GPUs during longer generations; use `disable_watermark=True` for local evaluation if needed. - This is a TorchAO/PyTorch runtime checkpoint, not GGUF/AWQ/GPTQ/EXL2. - Because TorchAO quantized tensor subclasses are serialized with `torch.save`, loading uses `weights_only=False`. ## Reproducing the quantization ```bash python scripts/export_int4.py \ --source MisoLabs/MisoTTS \ --output model_int4_torchao.pt \ --group-size 128 ``` The exporter streams the upstream `model.safetensors`, quantizes linear weights one at a time on CUDA, and saves the resulting quantized state dict. ## License The upstream model is marked `license: other` and includes the Modified MIT License from Miso Labs/Kamino Learning, Inc. The original license text is included in this repository. This quantized checkpoint is a derivative of `MisoLabs/MisoTTS`; follow the upstream license terms.