---
license: other
library_name: pytorch
pipeline_tag: text-to-speech
base_model: MisoLabs/MisoTTS
base_model_relation: quantized
tags:
  - text-to-speech
  - speech-synthesis
  - voice
  - audio
  - sesame
  - mimi
  - llama
  - quantized
  - torchao
  - int4
  - 4-bit
  - weight-only
  - base_model:quantized:MisoLabs/MisoTTS
---

# MisoTTS 8B TorchAO INT4 Weight-Only Quantization

This repository contains a 4-bit TorchAO weight-only quantization of
[`MisoLabs/MisoTTS`](https://huggingface.co/MisoLabs/MisoTTS), packaged so it can be loaded without first materializing the full 32 GB F32 checkpoint.

- **Base model:** `MisoLabs/MisoTTS`
- **Quantization:** TorchAO `Int4WeightOnlyConfig(group_size=128)`
- **Runtime format:** `torch.save` checkpoint containing TorchAO quantized tensor subclasses
- **Tested GPU:** RTX 3060 12 GB
- **Tokenizer:** upstream default `meta-llama/Llama-3.2-1B`
- **Language:** English, following the base model

No private prompt voice is included. Voice continuation/cloning requires user-supplied prompt audio and transcript.

## Why this exists

The upstream MisoTTS checkpoint is large and the default loader materializes F32 weights. This quantized variant targets consumer GPUs around 12 GB VRAM. It has been smoke-tested locally on an RTX 3060 using short and longer expressive generations.

## Install

Use Python 3.10 and the same dependency family as upstream MisoTTS. A practical setup is:

```bash
git clone https://huggingface.co/droyster/MisoTTS-8B-torchao-int4
cd MisoTTS-8B-torchao-int4
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

The loader uses upstream MisoTTS tokenizer behavior by default: `meta-llama/Llama-3.2-1B`. This requires a Hugging Face token/account that has access to Meta's Llama 3.2 tokenizer repo.

## Quick smoke test

```bash
python scripts/smoke_test.py \
  --repo-id droyster/MisoTTS-8B-torchao-int4 \
  --output smoke.wav \
  --disable-watermark
```

`--disable-watermark` is recommended on 12 GB GPUs for longer local evaluation runs because SilentCipher watermarking can add enough memory pressure to OOM.

## Python usage

```python
import torchaudio
from load_quantized import load_miso_8b_torchao_int4

# disable_watermark=True is useful on 12 GB GPUs for long generations.
generator = load_miso_8b_torchao_int4(
    "droyster/MisoTTS-8B-torchao-int4",
    device="cuda",
    disable_watermark=True,
)

audio = generator.generate(
    text="Hello from the four bit TorchAO quantized Miso TTS model.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
    temperature=0.8,
    topk=40,
)

torchaudio.save("miso_int4.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
```

## Prompted voice continuation

This repo does **not** include any voice prompt audio. To condition on a user-supplied voice, pass context segments exactly as upstream MisoTTS does:

```python
import torchaudio
from generator import Segment
from load_quantized import load_miso_8b_torchao_int4

generator = load_miso_8b_torchao_int4("droyster/MisoTTS-8B-torchao-int4", device="cuda")

prompt_audio, sr = torchaudio.load("prompt.wav")
prompt_audio = prompt_audio.mean(dim=0)
if sr != generator.sample_rate:
    prompt_audio = torchaudio.functional.resample(prompt_audio, sr, generator.sample_rate)

context = [Segment(
    speaker=0,
    text="Transcript of the prompt audio goes here.",
    audio=prompt_audio,
)]

audio = generator.generate(
    text="The next sentence to synthesize.",
    speaker=0,
    context=context,
    max_audio_length_ms=10_000,
)
```

## Known limitations

- Long generations can drift from a short voice prompt; use longer/better prompt context for stronger voice adherence.
- SilentCipher watermarking may OOM on 12 GB GPUs during longer generations; use `disable_watermark=True` for local evaluation if needed.
- This is a TorchAO/PyTorch runtime checkpoint, not GGUF/AWQ/GPTQ/EXL2.
- Because TorchAO quantized tensor subclasses are serialized with `torch.save`, loading uses `weights_only=False`.

## Reproducing the quantization

```bash
python scripts/export_int4.py \
  --source MisoLabs/MisoTTS \
  --output model_int4_torchao.pt \
  --group-size 128
```

The exporter streams the upstream `model.safetensors`, quantizes linear weights one at a time on CUDA, and saves the resulting quantized state dict.

## License

The upstream model is marked `license: other` and includes the Modified MIT License from Miso Labs/Kamino Learning, Inc. The original license text is included in this repository. This quantized checkpoint is a derivative of `MisoLabs/MisoTTS`; follow the upstream license terms.