# Qwen3-14B Quantization Comparison: Q3_K_S vs Q3_K_M vs Q3_HIFI ## Summary Table | Metric | Q3_K_S | Q3_K_M | Q3_HIFI | |--------|--------|--------|---------| | **File Size** | 6.19 GiB | 6.81 GiB | 6.59 GiB | | **Bits Per Weight** | 3.60 BPW | 3.96 BPW | 3.83 BPW | | **Perplexity (↓ better)** | 9.7089 ± 0.075 | 9.5313 ± 0.075 | **9.3788 ± 0.074** | | **Speed (TPS ↑ better)** | **91.52** | 85.40 | 85.58 | | **Prompt Eval (tok/s)** | 7,375 | 7,680 | 7,097 | --- ## Detailed Analysis ### 🏆 Q3_HIFI — Best Quality **Perplexity: 9.3788** (lowest = best) **Pros:** - **Best model quality** — 1.6% better perplexity than Q3_K_M, 3.4% better than Q3_K_S - Moderate file size (6.59 GiB) — smaller than Q3_K_M - Uses importance-matrix guided quantization on sensitive layers (53 Q3_HIFI tensors) - Good balance of quality and size **Cons:** - ~7% slower inference than Q3_K_S (85.58 vs 91.52 TPS) - Slightly larger than Q3_K_S (+400 MiB) **Use when:** Quality matters most and you can afford a small speed penalty. --- ### ⚡ Q3_K_S — Fastest & Smallest **Speed: 91.52 TPS** (highest) **Pros:** - **Fastest inference** — 7% faster than Q3_K_M/Q3_HIFI - **Smallest file** — 6.19 GiB saves ~620 MiB vs Q3_K_M - Lowest VRAM usage (CUDA0: 2,843 MiB vs 3,060-3,186 MiB) - Simplest tensor composition (mostly q3_K) **Cons:** - **Worst perplexity** — 9.7089 (3.5% worse than Q3_HIFI) - Noticeable quality degradation on complex tasks **Use when:** Speed and memory are critical, quality is secondary (e.g., quick prototyping, resource-constrained systems). --- ### ⚖️ Q3_K_M — Middle Ground **Perplexity: 9.5313** **Pros:** - Better quality than Q3_K_S (1.8% lower perplexity) - Highest prompt evaluation throughput (7,680 tok/s) - Standard llama.cpp quantization — widest compatibility **Cons:** - **Largest file** — 6.81 GiB - Lower quality than Q3_HIFI despite being larger - No significant speed advantage over Q3_HIFI **Use when:** You want a standard, well-tested quantization without custom formats. --- ## Visual Comparison ``` Quality (lower PPL = better) Q3_HIFI ████████████████████ 9.38 ← BEST Q3_K_M █████████████████████ 9.53 Q3_K_S ██████████████████████ 9.71 Speed (higher TPS = better) Q3_K_S ████████████████████ 91.5 ← FASTEST Q3_HIFI ██████████████████ 85.6 Q3_K_M ██████████████████ 85.4 Size (smaller = better) Q3_K_S ████████████████████ 6.19 GiB ← SMALLEST Q3_HIFI █████████████████████ 6.59 GiB Q3_K_M ██████████████████████ 6.81 GiB ``` --- ## Recommendation | Priority | Choose | |----------|--------| | **Best quality** | **Q3_HIFI** — 3.4% better PPL than Q3_K_S at only 7% speed cost | | **Best speed/memory** | **Q3_K_S** — Fastest inference, smallest footprint | | **Maximum compatibility** | **Q3_K_M** — Standard format, no custom tensor types | **Bottom line:** Q3_HIFI offers the best quality-to-size ratio. The importance-matrix guided quantization on sensitive layers pays off with measurably lower perplexity while staying smaller than Q3_K_M. Only choose Q3_K_S if you absolutely need the speed/memory savings.