Instructions to use aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir VibeVoice-Realtime-0.5B-MLX-INT8 aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8
- VibeVoice
How to use aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8 with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
VibeVoice-Realtime-0.5B-MLX-INT8
INT8-quantized MLX bundle of Microsoft VibeVoice-Realtime-0.5B for Apple Silicon, ready to load with the VibeVoiceTTS Swift module from soniqo/speech-swift.
INT8 is the middle-ground option — better quality headroom than INT4, smaller and faster than BF16. For most use cases INT4 is the right pick.
What's in the box
model.safetensors— INT8 group-quantized Qwen2 backbone (group_size=32, mode=affine), tokenizer + acoustic tokenizer + diffusion head + EOS classifier kept in source dtypequantization.json— per-layer manifest (244 quantized layers)config.json,preprocessor_config.json— copied from upstream
Bundle size: 1.42 GB.
Performance (Apple M2 Max, 64 GB)
| Steps | Audio | Elapsed | RTF | RTFx |
|---|---|---|---|---|
| 10 | 1.20 s | 0.64 s | 0.53 | 1.88× |
Sits between BF16 (1.48×) and INT4 (2.31×).
Use it
Swift / iOS / macOS
import VibeVoiceTTS
var config = VibeVoiceTTSModel.Configuration()
config.modelId = "aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8"
let tts = try await VibeVoiceTTSModel.fromPretrained(configuration: config)
try tts.loadVoice(from: "voice_cache/en-Mike_man.safetensors")
let pcm = try await tts.generate(text: "Hello world.")
CLI (audio from speech-swift)
audio vibevoice "Hello world." \
--model aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8 \
--voice-cache voice_cache/en-Mike_man.safetensors \
--output hello.wav
Voice caches
Same as the INT4 bundle — MIT-licensed examples at mzbac/vibevoice.swift/voice_cache, or mint your own with audio vibevoice-encode-voice.
Languages
English and Chinese only.
License
MIT, inherited from the upstream Microsoft VibeVoice repo.
Reproduction
models/vibevoice/export/convert.py in soniqo/speech-models (private), --bits 8.
Citation
@misc{microsoft_vibevoice,
title = {VibeVoice: Long-form, Multi-speaker Text-to-Speech},
author = {Microsoft Research},
year = {2025},
url = {https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B}
}
- Downloads last month
- 27
Quantized