aaardpark commited on
Commit
5152fbc
·
verified ·
1 Parent(s): d8adc14

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +10 -15
README.md CHANGED
@@ -12,7 +12,7 @@ quantized_by: aaardpark
12
 
13
  # Qwen2.5-72B-Instruct — GGUF (aaardpark)
14
 
15
- **35 GB Q3_K_M GGUF. 88% GSM8K, where standard 3-bit quantization (RTN) gets 16% at the same size.**
16
 
17
  > Looking for a smaller version? See [aaardpark/Qwen2.5-32B-Instruct-GGUF](https://huggingface.co/aaardpark/Qwen2.5-32B-Instruct-GGUF) — 15 GB, fits on a 24 GB machine.
18
 
@@ -56,12 +56,12 @@ You are a helpful assistant.<|im_end|>
56
 
57
  ### Base model evaluation (lm-evaluation-harness)
58
 
59
- | Metric | FP16 | This Quant (3-bit) | RTN 3-bit |
60
- |--------|------|---------------------|-----------|
61
- | **Perplexity** (wikitext-2) | 2.670 | **3.163** | 3.750 |
62
- | **GSM8K** (5-shot) | 90% | **88%** | 16% |
63
- | **MMLU avg** (5-shot) | 77.6% | **76.8%** | 73.0% |
64
- | TruthfulQA | 58.5% | 56.9% | 56.3% |
65
 
66
  Measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quantization method is identical for base and Instruct variants.
67
 
@@ -79,20 +79,15 @@ Measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quantization meth
79
  |--------|------|-----------|-------|-------|
80
  | FP16 | 16 | 2.670 | 90% | Baseline |
81
  | **This quant** | **3** | **3.163** | **88%** | **35 GB** |
82
- | RTN 3-bit | 3 | 3.750 | 16% | Standard rounding |
83
- | GPTQ 4-bit | 4 | 3.562* | — | 25% larger file |
84
  | RTN 4-bit | 4 | 2.790 | 88% | 45 GB |
85
  | **This quant (4-bit)** | **4** | **2.747** | **93%** | **Effectively lossless** |
86
 
87
- *GPTQ 4-bit PPL from Qwen2.5-32B (3.562), scaled comparison.
88
-
89
- On smaller models (7B): GPTQ 3-bit PPL = 12.576, our 3-bit PPL = 6.148. GPTQ is unusable at 3-bit; ours is not.
90
-
91
  ## Why this quant is different
92
 
93
- Standard 3-bit quantization (RTN) rounds each weight to the nearest grid point uniformly. This destroys the precise weight values that control multi-step reasoning GSM8K drops from 90% to 16%.
94
 
95
- Our method uses calibration data to identify which weights are critical for model quality, then allocates quantization precision accordingly. Same bit budget, dramatically different quality.
96
 
97
  ## Which file should I choose?
98
 
 
12
 
13
  # Qwen2.5-72B-Instruct — GGUF (aaardpark)
14
 
15
+ **35 GB Q3_K_M GGUF. 88% GSM8K at 3-bit.**
16
 
17
  > Looking for a smaller version? See [aaardpark/Qwen2.5-32B-Instruct-GGUF](https://huggingface.co/aaardpark/Qwen2.5-32B-Instruct-GGUF) — 15 GB, fits on a 24 GB machine.
18
 
 
56
 
57
  ### Base model evaluation (lm-evaluation-harness)
58
 
59
+ | Metric | FP16 | This Quant (3-bit) |
60
+ |--------|------|--------------------|
61
+ | **Perplexity** (wikitext-2) | 2.670 | **3.163** |
62
+ | **GSM8K** (5-shot) | 90% | **88%** |
63
+ | **MMLU avg** (5-shot) | 77.6% | **76.8%** |
64
+ | TruthfulQA | 58.5% | 56.9% |
65
 
66
  Measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quantization method is identical for base and Instruct variants.
67
 
 
79
  |--------|------|-----------|-------|-------|
80
  | FP16 | 16 | 2.670 | 90% | Baseline |
81
  | **This quant** | **3** | **3.163** | **88%** | **35 GB** |
82
+ | RTN 3-bit | 3 | 3.750 | | Standard rounding |
 
83
  | RTN 4-bit | 4 | 2.790 | 88% | 45 GB |
84
  | **This quant (4-bit)** | **4** | **2.747** | **93%** | **Effectively lossless** |
85
 
 
 
 
 
86
  ## Why this quant is different
87
 
88
+ Standard 3-bit quantization (RTN) rounds each weight to the nearest grid point uniformly. Our method uses calibration data to identify which weights are critical for model quality, then allocates quantization precision accordingly. Same bit budget, better weight choices.
89
 
90
+ The result: 88% GSM8K and 76.8% MMLU at 3-bit, within 2 points of FP16 on both benchmarks.
91
 
92
  ## Which file should I choose?
93