VibeThinker-3B — hipfire quantized (MQ4 + MQ6)
FWHT-rotated quantizations of WeiboAI/VibeThinker-3B for hipfire, a Rust-native inference engine for AMD RDNA GPUs.
Files
| File | Format | Size | Quality |
|---|---|---|---|
vibethinker-3b.mq4.hfq |
MQ4G256 (4-bit FWHT) | 1.8 GB | good |
vibethinker-3b.mq6.hfq |
MQ6G256 (6-bit FWHT) | 2.4 GB | better |
MQ4 is the default pick — fastest decode, smallest size.
MQ6 is worth it if you have extra VRAM budget and want tighter quality.
Usage
# pull and run (hipfire auto-selects the MQ4 file)
hipfire run xfivetide/VibeThinker-3B-hipfire "Your prompt here"
# or pick a specific quant explicitly
hipfire run xfivetide/VibeThinker-3B-hipfire/vibethinker-3b.mq6.hfq "Your prompt here"
Performance (gfx1151 / Strix Halo, AMD RDNA 3.5)
| Format | Size | Decode speed |
|---|---|---|
| BF16 safetensors (original) | 5.8 GB | ~8 tok/s |
| MQ4 (this repo) | 1.8 GB | ~36–90 tok/s |
| MQ6 (this repo) | 2.4 GB | ~25–60 tok/s |
Speed varies with context length (attention cost grows with KV-cache size).
Quantization commands
hipfire-quantize \
--input WeiboAI/VibeThinker-3B \
--output vibethinker-3b.mq4.hfq \
--format mq4 \
--arch-id 7
hipfire-quantize \
--input WeiboAI/VibeThinker-3B \
--output vibethinker-3b.mq6.hfq \
--format mq6 \
--arch-id 7
Note:
--arch-id 7is required. The quantizer auto-detectsqwen2as id=1 (llama), which silently drops Q/K/V attention biases at load time.
Format details
Both formats use MagnumQuant (MQ) with FWHT rotation (Walsh-Hadamard Transform, group size 256). The FWHT equalizes weight magnitudes across groups before quantization, reducing per-group error vs plain INT4/INT6.
- Norms, biases, and embeddings are stored as F16/Q8F16 (not quantized to 4/6-bit)
arch_id=7routes tohipfire-arch-qwen2, which correctly loads Qwen2's Q/K/V attention biases
About the model
VibeThinker-3B is a 3B-parameter reasoning model fine-tuned on math, competitive programming, and STEM tasks. It generates <think>…</think> reasoning traces before answering. hipfire's CLI strips them by default; raw JSONL clients see the full stream.
- Best for: competitive math, LeetCode-style coding, STEM reasoning
- Not recommended for: tool-calling, agentic coding, open-domain knowledge
- Recommended sampling: temperature 0.7–1.0, top_p 0.95
See WeiboAI/VibeThinker-3B for benchmarks and full details (MIT license).