---
license: apache-2.0
base_model: Qwen/Qwen2.5-72B-Instruct
tags:
  - quantized
  - gguf
  - 3-bit
  - qwen2
model_type: qwen2
quantized_by: aaardpark
---

# Qwen2.5-72B-Instruct — 3-bit GGUF (aaardpark)

Qwen2.5-72B-Instruct quantized to 3-bit using a new importance-weighted quantization method. Produces significantly better quality at 3-bit than standard RTN or naive quantization approaches.

## Key Results (Base Model Benchmarks)

| Metric | FP16 | This Quant (3-bit) | RTN 3-bit |
|--------|------|---------------------|-----------|
| **Perplexity** | 2.670 | **3.163** | 3.750 |
| **GSM8K** (5-shot) | 90% | **88%** | 16% |
| **MMLU avg** (5-shot) | 77.6% | **76.8%** | 73.0% |
| TruthfulQA | 58.5% | 56.9% | 56.3% |

Benchmarks measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quantization method is identical for both base and Instruct variants.

### vs Other Quantization Methods

| Method | Bits | PPL (72B) | GSM8K | Notes |
|--------|------|-----------|-------|-------|
| FP16 | 16 | 2.670 | 90% | Baseline |
| **This quant** | **3** | **3.163** | **88%** | **35 GB** |
| RTN 3-bit | 3 | 3.750 | 16% | Standard rounding |
| GPTQ 4-bit | 4 | 3.562* | — | 25% larger file |
| RTN 4-bit | 4 | 2.790 | 88% | 45 GB |
| **This quant (4-bit)** | **4** | **2.747** | **93%** | **Effectively lossless** |

*GPTQ 4-bit PPL from Qwen2.5-32B (3.562), scaled comparison.

On smaller models (7B): GPTQ 3-bit PPL = 12.576, our 3-bit PPL = 6.148. GPTQ is unusable at 3-bit; ours is not.

### GGUF Perplexity (wikitext-2, llama.cpp)

| Variant | PPL |
|---------|-----|
| Base Q8_0 (exact weights) | 3.028 |
| Base Q3_K_M (this format) | 2.904 |
| Instruct Q3_K_M | 3.962 |

## Why This Quant is Different

Standard 3-bit quantization (RTN) rounds each weight to the nearest grid point uniformly. This destroys the precise weight values that control multi-step reasoning — GSM8K drops from 90% to 16%.

Our method uses calibration data to identify which weights are critical for model quality, then allocates quantization precision accordingly. The result: 88% GSM8K at 3-bit, nearly matching FP16.

## Details
- **Method**: Importance-weighted per-group optimization
- **Group size**: 128
- **Quantization time**: ~20 minutes on a single GPU
- **GGUF format**: Q3_K_M (converted via llama.cpp)
- **File size**: 35 GB
- **Context**: 128K tokens

## How to Use

Works with llama.cpp, Ollama, LM Studio, or any GGUF-compatible runtime.

```bash
# llama.cpp
llama-cli -m Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf -ngl 99 -p "Hello!"

# Ollama
ollama run aaardpark/qwen2.5-72b-instruct
```

## Chat Template

This model uses the ChatML template:
```
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
```

## Acknowledgments

Built on [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) by Alibaba Cloud.