--- license: apache-2.0 base_model: Qwen/Qwen2.5-72B-Instruct tags: - quantized - gguf - 3-bit - qwen2 model_type: qwen2 quantized_by: aaardpark --- # Qwen2.5-72B-Instruct — 3-bit GGUF (aaardpark) Qwen2.5-72B-Instruct quantized to 3-bit using a new importance-weighted quantization method. Produces significantly better quality at 3-bit than standard RTN or naive quantization approaches. ## Key Results (Base Model Benchmarks) | Metric | FP16 | This Quant (3-bit) | RTN 3-bit | |--------|------|---------------------|-----------| | **Perplexity** | 2.670 | **3.163** | 3.750 | | **GSM8K** (5-shot) | 90% | **88%** | 16% | | **MMLU avg** (5-shot) | 77.6% | **76.8%** | 73.0% | | TruthfulQA | 58.5% | 56.9% | 56.3% | Benchmarks measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quantization method is identical for both base and Instruct variants. ### vs Other Quantization Methods | Method | Bits | PPL (72B) | GSM8K | Notes | |--------|------|-----------|-------|-------| | FP16 | 16 | 2.670 | 90% | Baseline | | **This quant** | **3** | **3.163** | **88%** | **35 GB** | | RTN 3-bit | 3 | 3.750 | 16% | Standard rounding | | GPTQ 4-bit | 4 | 3.562* | — | 25% larger file | | RTN 4-bit | 4 | 2.790 | 88% | 45 GB | | **This quant (4-bit)** | **4** | **2.747** | **93%** | **Effectively lossless** | *GPTQ 4-bit PPL from Qwen2.5-32B (3.562), scaled comparison. On smaller models (7B): GPTQ 3-bit PPL = 12.576, our 3-bit PPL = 6.148. GPTQ is unusable at 3-bit; ours is not. ### GGUF Perplexity (wikitext-2, llama.cpp) | Variant | PPL | |---------|-----| | Base Q8_0 (exact weights) | 3.028 | | Base Q3_K_M (this format) | 2.904 | | Instruct Q3_K_M | 3.962 | ## Why This Quant is Different Standard 3-bit quantization (RTN) rounds each weight to the nearest grid point uniformly. This destroys the precise weight values that control multi-step reasoning — GSM8K drops from 90% to 16%. Our method uses calibration data to identify which weights are critical for model quality, then allocates quantization precision accordingly. The result: 88% GSM8K at 3-bit, nearly matching FP16. ## Details - **Method**: Importance-weighted per-group optimization - **Group size**: 128 - **Quantization time**: ~20 minutes on a single GPU - **GGUF format**: Q3_K_M (converted via llama.cpp) - **File size**: 35 GB - **Context**: 128K tokens ## How to Use Works with llama.cpp, Ollama, LM Studio, or any GGUF-compatible runtime. ```bash # llama.cpp llama-cli -m Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf -ngl 99 -p "Hello!" # Ollama ollama run aaardpark/qwen2.5-72b-instruct ``` ## Chat Template This model uses the ChatML template: ``` <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user {prompt}<|im_end|> <|im_start|>assistant ``` ## Acknowledgments Built on [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) by Alibaba Cloud.