# Qwen3-14B Quantization Comparison Summary

## F16 Baseline Reference
| Metric | Value |
|--------|-------|
| **F16 Perplexity** | 9.0144 |
| **File Size** | 27.51 GiB (16.00 BPW) |

All precision loss percentages below are calculated relative to this F16 baseline.

---

## Q4_K_HIFI (INT8 Residuals + Per-Block Scale)
**Pros:**
- 🎯 Uses intelligent outlier preservation with INT8 residuals on critical tensors
- 🔬 11 tensors use Q6_K_HIFI_RES8 format for maximum precision on sensitive weights
- ⚡ Slightly faster than Q4_K_M (73.26 vs 73.04 TPS, 0.3% faster)
- 📊 **+1.4% PPL vs F16** without imatrix, **+1.2% with imatrix**

**Cons:**
- 💾 Larger file size at 8.70 GiB (+3.8% vs Q4_K_M)
- ❌ Slightly higher perplexity than Q4_K_M (9.1367 vs 9.1338)
- 🐢 Slower than Q4_K_S (73.26 vs 76.47 TPS, 4.2% slower)

**Best for:** Experimental/research use cases where INT8 residual format is being evaluated.

## Performance Comparison (Q4_K_HIFI vs the others)

### Q4_K_M

| Metric              | Q4_K_HIFI    | Q4_K_M     | Difference                    |
|---------------------|------------|------------|-------------------------------|
| **Speed (TPS)**     | 73.26      | 73.04      | +0.22 (0.3% faster)           |
| **Perplexity**      | 9.1367     | 9.1338     | +0.003 (0.03% worse)          |
| **PPL vs F16**      | +1.4%      | +1.3%      | 0.1% more precision loss      |
| **File Size**       | 8.70 GiB   | 8.38 GiB   | +0.32 GiB (3.8% larger)       |
| **Bits Per Weight** | 5.06       | 4.87       | +0.19 (3.9% more)             |

**Pros:**
- ⚖️ Traditional "balanced" approach between speed and quality
- 📚 Well-documented, standard quantization method
- 💾 Smaller file size than Q4_K_HIFI
- 🏆 **Best quality** with lowest perplexity of 9.1338
- 📊 **+1.3% PPL vs F16** — lowest precision loss without imatrix ✅

**Cons:**
- ⚡ Marginally slower than Q4_K_HIFI (0.3%)

**Best for:** General-purpose use — excellent balance of quality, size, and compatibility.
**Summary:** Q4_K_M delivers marginally better quality (0.03% lower perplexity) than Q4_K_HIFI with 3.8% less storage. At 14B scale, Q4_K_M is the better choice.

### Q4_K_S

| Metric              | Q4_K_HIFI    | Q4_K_S     | Difference                    |
|---------------------|------------|------------|-------------------------------|
| **Speed (TPS)**     | 73.26      | 76.47      | -3.21 (4.2% slower)           |
| **Perplexity**      | 9.1367     | 9.1912     | **-0.055 (0.59% better)**     |
| **PPL vs F16**      | +1.4%      | +2.0%      | 0.6% less precision loss      |
| **File Size**       | 8.70 GiB   | 7.98 GiB   | +0.72 GiB (9.0% larger)       |
| **Bits Per Weight** | 5.06       | 4.64       | +0.42 (9.1% more)             |

**Pros:**
- ⚡ **Fastest inference** at 76.47 TPS (4.2% faster than Q4_K_HIFI)
- 💾 **Smallest file size** at 7.98 GiB
- ✅ Best choice when speed and storage are critical

**Cons:**
- ❌ **Lower quality** with perplexity of 9.1912 (0.59% higher than Q4_K_HIFI)
- 📊 **+2.0% PPL vs F16** — highest precision loss, but still excellent
- ⚠️ Uses minimal q6_K enhancement — impacts quality on sensitive weights

**Best for:** Extreme resource constraints, quick prototyping, or bulk processing where quality is less important.
**Summary:** Q4_K_HIFI trades a 4.2% speed reduction and 9.0% larger file size for a modest 0.59% improvement in quality over Q4_K_S.

---

## Recommendation Matrix

| Priority          | Recommended Model | Rationale                                                                      |
|-------------------|-------------------|--------------------------------------------------------------------------------|
| **Quality First** | Q4_K_M            | Lowest perplexity (9.1338), good balance of size and speed                     |
| **Speed First**   | Q4_K_S            | 4.2% faster than Q4_K_HIFI, acceptable quality degradation                       |
| **Best Balance**  | Q4_K_M            | Best quality with reasonable speed and smaller size than Q4_K_HIFI               |
| **Smallest Size** | Q4_K_S            | 9.0% smaller than Q4_K_HIFI, 4.8% smaller than Q4_K_M                            |

---

## Key Insight

**At 14B scale, Q4_K_HIFI provides no quality benefit over Q4_K_M.** At 14B scale:
- **0.03% higher perplexity** than Q4_K_M (9.1367 vs 9.1338) — Q4_K_M is actually better!
- **0.59% lower perplexity** than Q4_K_S (9.1367 vs 9.1912)
- **0.3% faster** than Q4_K_M, **4.2% slower** than Q4_K_S
- **All variants lose only 1.3-2.0% precision vs F16** (9.01 baseline) — outstanding retention!

The INT8 residual format (Q6_K_HIFI_RES8) does not provide meaningful quality improvements at 14B scale. The model is highly robust to quantization error, making the additional storage overhead unjustified.

💡 **Scale Effect:** At 14B scale, the model is highly robust to quantization error. The perplexity differences between quantization methods are minimal (~0.01-0.6%). **Q4_K_M offers the best quality/size trade-off.**

---

## Precision Loss Summary (vs F16 Baseline: PPL 9.0144)

| Model     | PPL (no imatrix) | vs F16 | PPL (imatrix) | vs F16 |
|-----------|------------------|--------|---------------|--------|
| Q4_K_M    | 9.1338           | **+1.3%** ✅ | 9.1192 | **+1.2%** ✅ |
| Q4_K_HIFI | 9.1367           | **+1.4%** | 9.1216 | **+1.2%** |
| Q4_K_S    | 9.1912           | **+2.0%** | 9.1410 | **+1.4%** |

**Key Observations:**
- Without imatrix: Q4_K variants lose only 1.3-2.0% precision vs F16 — outstanding retention at 14B scale!
- With imatrix: Precision loss drops to just 1.2-1.4% — virtually F16 quality
- Q4_K_M achieves the **lowest precision loss** both with and without imatrix
- The 14B model shows the best quantization resilience of all model sizes tested

---

## Tensor Distribution

| Model   | q4_K | q5_K | q6_K | Q6_K_HIFI_RES8 | f32 | Total |
|---------|------|------|------|----------------|-----|-------|
| Q4_K_S  | 272  | 9    | 1    | 0              | 161 | 443   |
| Q4_K_M  | 241  | 0    | 41   | 0              | 161 | 443   |
| Q4_K_HIFI | 237  | 0    | 34   | 11             | 161 | 443   |

**Q4_K_HIFI Enhancement:** 11 critical tensors use Q6_K_HIFI_RES8 format with INT8 residuals + per-block scale for maximum precision.

---

## Addendum: Impact of imatrix on Q4_K_M and Q4_K_S

When Q4_K_M and Q4_K_S are quantized **with an importance matrix (imatrix)**, their quality improves significantly.

### imatrix Perplexity Improvements

| Model   | Without imatrix | With imatrix | Improvement | PPL vs F16 (imatrix) |
|---------|-----------------|--------------|-------------|----------------------|
| Q4_K_HIFI | 9.1367          | **9.1216**   | **-0.015 (0.17% better)** | **+1.2%** |
| Q4_K_M  | 9.1338          | **9.1192**   | **-0.015 (0.16% better)** | **+1.2%** ✅ |
| Q4_K_S  | 9.1912          | **9.1410**   | **-0.050 (0.55% better)** | **+1.4%** |

### Revised Comparison (All with imatrix)

| Model   | PPL (imatrix) | vs Q4_K_HIFI | vs F16 | Size |
|---------|---------------|------------|--------|------|
| Q4_K_M  | 9.1192        | **-0.002 (-0.03%)** ✅ | **+1.2%** ✅ | 8.38 GiB |
| Q4_K_HIFI | 9.1216        | baseline   | +1.2% | 8.70 GiB |
| Q4_K_S  | 9.1410        | +0.019 (+0.21%) | +1.4% | 7.98 GiB |

### Key Findings

**Q4_K_M outperforms Q4_K_HIFI both with and without imatrix:**

| Comparison | Without imatrix | With imatrix |
|------------|-----------------|--------------|
| Q4_K_HIFI vs Q4_K_M | **+0.03%** (Q4_K_M better) | **+0.03%** (Q4_K_M better) |
| Q4_K_HIFI vs Q4_K_S | **-0.59%** (Q4_K_HIFI better) | **-0.21%** (Q4_K_HIFI better) |

### Revised Recommendations (When Using imatrix)

| Priority          | Without imatrix | With imatrix |
|-------------------|-----------------|--------------|
| **Quality First** | **Q4_K_M** ✅   | **Q4_K_M** ✅ |
| **Best Balance**  | **Q4_K_M** ✅   | **Q4_K_M** ✅ |
| **Size/Speed**    | Q4_K_S          | Q4_K_S       |

### Conclusion

**Q4_K_M is the recommended quantization for Qwen3-14B:**
- Q4_K_M has the best perplexity both with and without imatrix
- Q4_K_HIFI's 3.8% size overhead provides no quality benefit
- **Q4_K_M + imatrix** offers the best balance of quality/size/speed/compatibility

**At 14B scale, the INT8 residual format does not provide measurable benefits.**

---

## Appendix (Test Environment Details)

| Component     | Specification                          |
|---------------|----------------------------------------|
| **OS**        | Ubuntu 24.04.3 LTS                     |
| **CPU**       | AMD EPYC 9254 24-Core Processor        |
| **CPU Cores** | 96 cores (2 threads/core)              |
| **RAM**       | 1.0Ti                                  |
| **GPU**       | NVIDIA L40S × 2                        |
| **VRAM**      | 46068 MiB per GPU                      |
| **CUDA**      | 12.9                                   |
| **Test Data** | wikitext-2-raw, 584 chunks             |
| **Context**   | 512 tokens                             |
| **Samples**   | 100 per speed benchmark                |
| **imatrix**   | mixed-imatrix-dataset.txt, 4697 chunks |