# Qwen3-14B Quantization Comparison Summary ## F16 Baseline Reference | Metric | Value | |--------|-------| | **F16 Perplexity** | 9.0144 | | **File Size** | 27.51 GiB (16.00 BPW) | All precision loss percentages below are calculated relative to this F16 baseline. --- ## Q4_K_HIFI (INT8 Residuals + Per-Block Scale) **Pros:** - 🎯 Uses intelligent outlier preservation with INT8 residuals on critical tensors - 🔬 11 tensors use Q6_K_HIFI_RES8 format for maximum precision on sensitive weights - ⚡ Slightly faster than Q4_K_M (73.26 vs 73.04 TPS, 0.3% faster) - 📊 **+1.4% PPL vs F16** without imatrix, **+1.2% with imatrix** **Cons:** - 💾 Larger file size at 8.70 GiB (+3.8% vs Q4_K_M) - ❌ Slightly higher perplexity than Q4_K_M (9.1367 vs 9.1338) - 🐢 Slower than Q4_K_S (73.26 vs 76.47 TPS, 4.2% slower) **Best for:** Experimental/research use cases where INT8 residual format is being evaluated. ## Performance Comparison (Q4_K_HIFI vs the others) ### Q4_K_M | Metric | Q4_K_HIFI | Q4_K_M | Difference | |---------------------|------------|------------|-------------------------------| | **Speed (TPS)** | 73.26 | 73.04 | +0.22 (0.3% faster) | | **Perplexity** | 9.1367 | 9.1338 | +0.003 (0.03% worse) | | **PPL vs F16** | +1.4% | +1.3% | 0.1% more precision loss | | **File Size** | 8.70 GiB | 8.38 GiB | +0.32 GiB (3.8% larger) | | **Bits Per Weight** | 5.06 | 4.87 | +0.19 (3.9% more) | **Pros:** - ⚖️ Traditional "balanced" approach between speed and quality - 📚 Well-documented, standard quantization method - 💾 Smaller file size than Q4_K_HIFI - 🏆 **Best quality** with lowest perplexity of 9.1338 - 📊 **+1.3% PPL vs F16** — lowest precision loss without imatrix ✅ **Cons:** - ⚡ Marginally slower than Q4_K_HIFI (0.3%) **Best for:** General-purpose use — excellent balance of quality, size, and compatibility. **Summary:** Q4_K_M delivers marginally better quality (0.03% lower perplexity) than Q4_K_HIFI with 3.8% less storage. At 14B scale, Q4_K_M is the better choice. ### Q4_K_S | Metric | Q4_K_HIFI | Q4_K_S | Difference | |---------------------|------------|------------|-------------------------------| | **Speed (TPS)** | 73.26 | 76.47 | -3.21 (4.2% slower) | | **Perplexity** | 9.1367 | 9.1912 | **-0.055 (0.59% better)** | | **PPL vs F16** | +1.4% | +2.0% | 0.6% less precision loss | | **File Size** | 8.70 GiB | 7.98 GiB | +0.72 GiB (9.0% larger) | | **Bits Per Weight** | 5.06 | 4.64 | +0.42 (9.1% more) | **Pros:** - ⚡ **Fastest inference** at 76.47 TPS (4.2% faster than Q4_K_HIFI) - 💾 **Smallest file size** at 7.98 GiB - ✅ Best choice when speed and storage are critical **Cons:** - ❌ **Lower quality** with perplexity of 9.1912 (0.59% higher than Q4_K_HIFI) - 📊 **+2.0% PPL vs F16** — highest precision loss, but still excellent - ⚠️ Uses minimal q6_K enhancement — impacts quality on sensitive weights **Best for:** Extreme resource constraints, quick prototyping, or bulk processing where quality is less important. **Summary:** Q4_K_HIFI trades a 4.2% speed reduction and 9.0% larger file size for a modest 0.59% improvement in quality over Q4_K_S. --- ## Recommendation Matrix | Priority | Recommended Model | Rationale | |-------------------|-------------------|--------------------------------------------------------------------------------| | **Quality First** | Q4_K_M | Lowest perplexity (9.1338), good balance of size and speed | | **Speed First** | Q4_K_S | 4.2% faster than Q4_K_HIFI, acceptable quality degradation | | **Best Balance** | Q4_K_M | Best quality with reasonable speed and smaller size than Q4_K_HIFI | | **Smallest Size** | Q4_K_S | 9.0% smaller than Q4_K_HIFI, 4.8% smaller than Q4_K_M | --- ## Key Insight **At 14B scale, Q4_K_HIFI provides no quality benefit over Q4_K_M.** At 14B scale: - **0.03% higher perplexity** than Q4_K_M (9.1367 vs 9.1338) — Q4_K_M is actually better! - **0.59% lower perplexity** than Q4_K_S (9.1367 vs 9.1912) - **0.3% faster** than Q4_K_M, **4.2% slower** than Q4_K_S - **All variants lose only 1.3-2.0% precision vs F16** (9.01 baseline) — outstanding retention! The INT8 residual format (Q6_K_HIFI_RES8) does not provide meaningful quality improvements at 14B scale. The model is highly robust to quantization error, making the additional storage overhead unjustified. 💡 **Scale Effect:** At 14B scale, the model is highly robust to quantization error. The perplexity differences between quantization methods are minimal (~0.01-0.6%). **Q4_K_M offers the best quality/size trade-off.** --- ## Precision Loss Summary (vs F16 Baseline: PPL 9.0144) | Model | PPL (no imatrix) | vs F16 | PPL (imatrix) | vs F16 | |-----------|------------------|--------|---------------|--------| | Q4_K_M | 9.1338 | **+1.3%** ✅ | 9.1192 | **+1.2%** ✅ | | Q4_K_HIFI | 9.1367 | **+1.4%** | 9.1216 | **+1.2%** | | Q4_K_S | 9.1912 | **+2.0%** | 9.1410 | **+1.4%** | **Key Observations:** - Without imatrix: Q4_K variants lose only 1.3-2.0% precision vs F16 — outstanding retention at 14B scale! - With imatrix: Precision loss drops to just 1.2-1.4% — virtually F16 quality - Q4_K_M achieves the **lowest precision loss** both with and without imatrix - The 14B model shows the best quantization resilience of all model sizes tested --- ## Tensor Distribution | Model | q4_K | q5_K | q6_K | Q6_K_HIFI_RES8 | f32 | Total | |---------|------|------|------|----------------|-----|-------| | Q4_K_S | 272 | 9 | 1 | 0 | 161 | 443 | | Q4_K_M | 241 | 0 | 41 | 0 | 161 | 443 | | Q4_K_HIFI | 237 | 0 | 34 | 11 | 161 | 443 | **Q4_K_HIFI Enhancement:** 11 critical tensors use Q6_K_HIFI_RES8 format with INT8 residuals + per-block scale for maximum precision. --- ## Addendum: Impact of imatrix on Q4_K_M and Q4_K_S When Q4_K_M and Q4_K_S are quantized **with an importance matrix (imatrix)**, their quality improves significantly. ### imatrix Perplexity Improvements | Model | Without imatrix | With imatrix | Improvement | PPL vs F16 (imatrix) | |---------|-----------------|--------------|-------------|----------------------| | Q4_K_HIFI | 9.1367 | **9.1216** | **-0.015 (0.17% better)** | **+1.2%** | | Q4_K_M | 9.1338 | **9.1192** | **-0.015 (0.16% better)** | **+1.2%** ✅ | | Q4_K_S | 9.1912 | **9.1410** | **-0.050 (0.55% better)** | **+1.4%** | ### Revised Comparison (All with imatrix) | Model | PPL (imatrix) | vs Q4_K_HIFI | vs F16 | Size | |---------|---------------|------------|--------|------| | Q4_K_M | 9.1192 | **-0.002 (-0.03%)** ✅ | **+1.2%** ✅ | 8.38 GiB | | Q4_K_HIFI | 9.1216 | baseline | +1.2% | 8.70 GiB | | Q4_K_S | 9.1410 | +0.019 (+0.21%) | +1.4% | 7.98 GiB | ### Key Findings **Q4_K_M outperforms Q4_K_HIFI both with and without imatrix:** | Comparison | Without imatrix | With imatrix | |------------|-----------------|--------------| | Q4_K_HIFI vs Q4_K_M | **+0.03%** (Q4_K_M better) | **+0.03%** (Q4_K_M better) | | Q4_K_HIFI vs Q4_K_S | **-0.59%** (Q4_K_HIFI better) | **-0.21%** (Q4_K_HIFI better) | ### Revised Recommendations (When Using imatrix) | Priority | Without imatrix | With imatrix | |-------------------|-----------------|--------------| | **Quality First** | **Q4_K_M** ✅ | **Q4_K_M** ✅ | | **Best Balance** | **Q4_K_M** ✅ | **Q4_K_M** ✅ | | **Size/Speed** | Q4_K_S | Q4_K_S | ### Conclusion **Q4_K_M is the recommended quantization for Qwen3-14B:** - Q4_K_M has the best perplexity both with and without imatrix - Q4_K_HIFI's 3.8% size overhead provides no quality benefit - **Q4_K_M + imatrix** offers the best balance of quality/size/speed/compatibility **At 14B scale, the INT8 residual format does not provide measurable benefits.** --- ## Appendix (Test Environment Details) | Component | Specification | |---------------|----------------------------------------| | **OS** | Ubuntu 24.04.3 LTS | | **CPU** | AMD EPYC 9254 24-Core Processor | | **CPU Cores** | 96 cores (2 threads/core) | | **RAM** | 1.0Ti | | **GPU** | NVIDIA L40S × 2 | | **VRAM** | 46068 MiB per GPU | | **CUDA** | 12.9 | | **Test Data** | wikitext-2-raw, 584 chunks | | **Context** | 512 tokens | | **Samples** | 100 per speed benchmark | | **imatrix** | mixed-imatrix-dataset.txt, 4697 chunks |