# Qwen3-14B Quantization Comparison: Q3_K_S vs Q3_K_M vs Q3_HIFI

## Summary Table

| Metric | Q3_K_S | Q3_K_M | Q3_HIFI |
|--------|--------|--------|---------|
| **File Size** | 6.19 GiB | 6.81 GiB | 6.59 GiB |
| **Bits Per Weight** | 3.60 BPW | 3.96 BPW | 3.83 BPW |
| **Perplexity (↓ better)** | 9.7089 ± 0.075 | 9.5313 ± 0.075 | **9.3788 ± 0.074** |
| **Speed (TPS ↑ better)** | **91.52** | 85.40 | 85.58 |
| **Prompt Eval (tok/s)** | 7,375 | 7,680 | 7,097 |

---

## Detailed Analysis

### 🏆 Q3_HIFI — Best Quality
**Perplexity: 9.3788** (lowest = best)

**Pros:**
- **Best model quality** — 1.6% better perplexity than Q3_K_M, 3.4% better than Q3_K_S
- Moderate file size (6.59 GiB) — smaller than Q3_K_M
- Uses importance-matrix guided quantization on sensitive layers (53 Q3_HIFI tensors)
- Good balance of quality and size

**Cons:**
- ~7% slower inference than Q3_K_S (85.58 vs 91.52 TPS)
- Slightly larger than Q3_K_S (+400 MiB)

**Use when:** Quality matters most and you can afford a small speed penalty.

---

### ⚡ Q3_K_S — Fastest & Smallest
**Speed: 91.52 TPS** (highest)

**Pros:**
- **Fastest inference** — 7% faster than Q3_K_M/Q3_HIFI
- **Smallest file** — 6.19 GiB saves ~620 MiB vs Q3_K_M
- Lowest VRAM usage (CUDA0: 2,843 MiB vs 3,060-3,186 MiB)
- Simplest tensor composition (mostly q3_K)

**Cons:**
- **Worst perplexity** — 9.7089 (3.5% worse than Q3_HIFI)
- Noticeable quality degradation on complex tasks

**Use when:** Speed and memory are critical, quality is secondary (e.g., quick prototyping, resource-constrained systems).

---

### ⚖️ Q3_K_M — Middle Ground
**Perplexity: 9.5313**

**Pros:**
- Better quality than Q3_K_S (1.8% lower perplexity)
- Highest prompt evaluation throughput (7,680 tok/s)
- Standard llama.cpp quantization — widest compatibility

**Cons:**
- **Largest file** — 6.81 GiB
- Lower quality than Q3_HIFI despite being larger
- No significant speed advantage over Q3_HIFI

**Use when:** You want a standard, well-tested quantization without custom formats.

---

## Visual Comparison

```
Quality (lower PPL = better)
Q3_HIFI ████████████████████ 9.38  ← BEST
Q3_K_M  █████████████████████ 9.53
Q3_K_S  ██████████████████████ 9.71

Speed (higher TPS = better)  
Q3_K_S  ████████████████████ 91.5  ← FASTEST
Q3_HIFI ██████████████████ 85.6
Q3_K_M  ██████████████████ 85.4

Size (smaller = better)
Q3_K_S  ████████████████████ 6.19 GiB  ← SMALLEST
Q3_HIFI █████████████████████ 6.59 GiB
Q3_K_M  ██████████████████████ 6.81 GiB
```

---

## Recommendation

| Priority | Choose |
|----------|--------|
| **Best quality** | **Q3_HIFI** — 3.4% better PPL than Q3_K_S at only 7% speed cost |
| **Best speed/memory** | **Q3_K_S** — Fastest inference, smallest footprint |
| **Maximum compatibility** | **Q3_K_M** — Standard format, no custom tensor types |

**Bottom line:** Q3_HIFI offers the best quality-to-size ratio. The importance-matrix guided quantization on sensitive layers pays off with measurably lower perplexity while staying smaller than Q3_K_M. Only choose Q3_K_S if you absolutely need the speed/memory savings.