Qwen3-14B-f16 / Q3_Quantization_Comparison.md
geoffmunn's picture
Upload 3 files
33cf9d0 verified
|
Raw
History Blame
3.59 kB

Qwen3-14B Quantization Comparison: Q3_K_S vs Q3_K_M vs Q3_HIFI

Summary Table

Metric Q3_K_S Q3_K_M Q3_HIFI
File Size 6.19 GiB 6.81 GiB 6.59 GiB
Bits Per Weight 3.60 BPW 3.96 BPW 3.83 BPW
Perplexity (↓ better) 9.7089 Β± 0.075 9.5313 Β± 0.075 9.3788 Β± 0.074
Speed (TPS ↑ better) 91.52 85.40 85.58
Prompt Eval (tok/s) 7,375 7,680 7,097

Detailed Analysis

πŸ† Q3_HIFI β€” Best Quality

Perplexity: 9.3788 (lowest = best)

Pros:

  • Best model quality β€” 1.6% better perplexity than Q3_K_M, 3.4% better than Q3_K_S
  • Moderate file size (6.59 GiB) β€” smaller than Q3_K_M
  • Uses importance-matrix guided quantization on sensitive layers (53 Q3_HIFI tensors)
  • Good balance of quality and size

Cons:

  • ~7% slower inference than Q3_K_S (85.58 vs 91.52 TPS)
  • Slightly larger than Q3_K_S (+400 MiB)

Use when: Quality matters most and you can afford a small speed penalty.


⚑ Q3_K_S β€” Fastest & Smallest

Speed: 91.52 TPS (highest)

Pros:

  • Fastest inference β€” 7% faster than Q3_K_M/Q3_HIFI
  • Smallest file β€” 6.19 GiB saves ~620 MiB vs Q3_K_M
  • Lowest VRAM usage (CUDA0: 2,843 MiB vs 3,060-3,186 MiB)
  • Simplest tensor composition (mostly q3_K)

Cons:

  • Worst perplexity β€” 9.7089 (3.5% worse than Q3_HIFI)
  • Noticeable quality degradation on complex tasks

Use when: Speed and memory are critical, quality is secondary (e.g., quick prototyping, resource-constrained systems).


βš–οΈ Q3_K_M β€” Middle Ground

Perplexity: 9.5313

Pros:

  • Better quality than Q3_K_S (1.8% lower perplexity)
  • Highest prompt evaluation throughput (7,680 tok/s)
  • Standard llama.cpp quantization β€” widest compatibility

Cons:

  • Largest file β€” 6.81 GiB
  • Lower quality than Q3_HIFI despite being larger
  • No significant speed advantage over Q3_HIFI

Use when: You want a standard, well-tested quantization without custom formats.


Visual Comparison

Quality (lower PPL = better)
Q3_HIFI β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 9.38  ← BEST
Q3_K_M  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 9.53
Q3_K_S  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 9.71

Speed (higher TPS = better)  
Q3_K_S  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 91.5  ← FASTEST
Q3_HIFI β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 85.6
Q3_K_M  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 85.4

Size (smaller = better)
Q3_K_S  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 6.19 GiB  ← SMALLEST
Q3_HIFI β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 6.59 GiB
Q3_K_M  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 6.81 GiB

Recommendation

Priority Choose
Best quality Q3_HIFI β€” 3.4% better PPL than Q3_K_S at only 7% speed cost
Best speed/memory Q3_K_S β€” Fastest inference, smallest footprint
Maximum compatibility Q3_K_M β€” Standard format, no custom tensor types

Bottom line: Q3_HIFI offers the best quality-to-size ratio. The importance-matrix guided quantization on sensitive layers pays off with measurably lower perplexity while staying smaller than Q3_K_M. Only choose Q3_K_S if you absolutely need the speed/memory savings.