aaardpark commited on
Commit
d8adc14
·
verified ·
1 Parent(s): a6d7cf0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +69 -41
README.md CHANGED
@@ -10,22 +10,70 @@ model_type: qwen2
10
  quantized_by: aaardpark
11
  ---
12
 
13
- # Qwen2.5-72B-Instruct — 3-bit GGUF (aaardpark)
14
 
15
- Qwen2.5-72B-Instruct quantized to 3-bit using a new importance-weighted quantization method. Produces significantly better quality at 3-bit than standard RTN or naive quantization approaches.
16
 
17
- ## Key Results (Base Model Benchmarks)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  | Metric | FP16 | This Quant (3-bit) | RTN 3-bit |
20
  |--------|------|---------------------|-----------|
21
- | **Perplexity** | 2.670 | **3.163** | 3.750 |
22
  | **GSM8K** (5-shot) | 90% | **88%** | 16% |
23
  | **MMLU avg** (5-shot) | 77.6% | **76.8%** | 73.0% |
24
  | TruthfulQA | 58.5% | 56.9% | 56.3% |
25
 
26
- Benchmarks measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quantization method is identical for both base and Instruct variants.
27
 
28
- ### vs Other Quantization Methods
 
 
 
 
 
 
 
 
29
 
30
  | Method | Bits | PPL (72B) | GSM8K | Notes |
31
  |--------|------|-----------|-------|-------|
@@ -40,51 +88,31 @@ Benchmarks measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quanti
40
 
41
  On smaller models (7B): GPTQ 3-bit PPL = 12.576, our 3-bit PPL = 6.148. GPTQ is unusable at 3-bit; ours is not.
42
 
43
- ### GGUF Perplexity (wikitext-2, llama.cpp)
44
-
45
- | Variant | PPL |
46
- |---------|-----|
47
- | Base Q8_0 (exact weights) | 3.028 |
48
- | Base Q3_K_M (this format) | 2.904 |
49
- | Instruct Q3_K_M | 3.962 |
50
-
51
- ## Why This Quant is Different
52
 
53
  Standard 3-bit quantization (RTN) rounds each weight to the nearest grid point uniformly. This destroys the precise weight values that control multi-step reasoning — GSM8K drops from 90% to 16%.
54
 
55
- Our method uses calibration data to identify which weights are critical for model quality, then allocates quantization precision accordingly. The result: 88% GSM8K at 3-bit, nearly matching FP16.
56
 
57
- ## Details
58
- - **Method**: Importance-weighted per-group optimization
59
- - **Group size**: 128
60
- - **Quantization time**: ~20 minutes on a single GPU
61
- - **GGUF format**: Q3_K_M (converted via llama.cpp)
62
- - **File size**: 35 GB
63
- - **Context**: 128K tokens
64
 
65
- ## How to Use
66
 
67
- Works with llama.cpp, Ollama, LM Studio, or any GGUF-compatible runtime.
 
 
 
68
 
69
- ```bash
70
- # llama.cpp
71
- llama-cli -m Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf -ngl 99 -p "Hello!"
72
 
73
- # Ollama
74
- ollama run aaardpark/qwen2.5-72b-instruct
75
- ```
76
 
77
- ## Chat Template
78
 
79
- This model uses the ChatML template:
80
- ```
81
- <|im_start|>system
82
- You are a helpful assistant.<|im_end|>
83
- <|im_start|>user
84
- {prompt}<|im_end|>
85
- <|im_start|>assistant
86
- ```
87
 
88
  ## Acknowledgments
89
 
90
- Built on [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) by Alibaba Cloud.
 
10
  quantized_by: aaardpark
11
  ---
12
 
13
+ # Qwen2.5-72B-Instruct — GGUF (aaardpark)
14
 
15
+ **35 GB Q3_K_M GGUF. 88% GSM8K, where standard 3-bit quantization (RTN) gets 16% at the same size.**
16
 
17
+ > Looking for a smaller version? See [aaardpark/Qwen2.5-32B-Instruct-GGUF](https://huggingface.co/aaardpark/Qwen2.5-32B-Instruct-GGUF) — 15 GB, fits on a 24 GB machine.
18
+
19
+ ## Quick stats
20
+
21
+ | File | Size | BPW | Min RAM | Speed (M5 Max, Metal) |
22
+ |---|---|---|---|---|
23
+ | `Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf` | 35 GB | 3.9 | 48 GB | ~5 tok/s |
24
+
25
+ ## How to use
26
+
27
+ ### Download
28
+
29
+ ```bash
30
+ huggingface-cli download aaardpark/Qwen2.5-72B-Instruct-GGUF \
31
+ Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf --local-dir .
32
+ ```
33
+
34
+ ### Run
35
+
36
+ **llama.cpp:**
37
+ ```bash
38
+ llama-cli -m Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf -ngl 99 -p "Hello!"
39
+ ```
40
+
41
+ **LM Studio:** Search for `aaardpark/Qwen2.5-72B-Instruct-GGUF` in the model browser.
42
+
43
+ ## Prompt format
44
+
45
+ This model uses the ChatML template:
46
+
47
+ ```
48
+ <|im_start|>system
49
+ You are a helpful assistant.<|im_end|>
50
+ <|im_start|>user
51
+ {prompt}<|im_end|>
52
+ <|im_start|>assistant
53
+ ```
54
+
55
+ ## Benchmarks
56
+
57
+ ### Base model evaluation (lm-evaluation-harness)
58
 
59
  | Metric | FP16 | This Quant (3-bit) | RTN 3-bit |
60
  |--------|------|---------------------|-----------|
61
+ | **Perplexity** (wikitext-2) | 2.670 | **3.163** | 3.750 |
62
  | **GSM8K** (5-shot) | 90% | **88%** | 16% |
63
  | **MMLU avg** (5-shot) | 77.6% | **76.8%** | 73.0% |
64
  | TruthfulQA | 58.5% | 56.9% | 56.3% |
65
 
66
+ Measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quantization method is identical for base and Instruct variants.
67
 
68
+ ### GGUF perplexity (wikitext-2, llama.cpp)
69
+
70
+ | Variant | PPL |
71
+ |---------|-----|
72
+ | Base Q8_0 (exact weights) | 3.028 |
73
+ | Base Q3_K_M (this format) | 2.904 |
74
+ | Instruct Q3_K_M | 3.962 |
75
+
76
+ ### vs other quantization methods
77
 
78
  | Method | Bits | PPL (72B) | GSM8K | Notes |
79
  |--------|------|-----------|-------|-------|
 
88
 
89
  On smaller models (7B): GPTQ 3-bit PPL = 12.576, our 3-bit PPL = 6.148. GPTQ is unusable at 3-bit; ours is not.
90
 
91
+ ## Why this quant is different
 
 
 
 
 
 
 
 
92
 
93
  Standard 3-bit quantization (RTN) rounds each weight to the nearest grid point uniformly. This destroys the precise weight values that control multi-step reasoning — GSM8K drops from 90% to 16%.
94
 
95
+ Our method uses calibration data to identify which weights are critical for model quality, then allocates quantization precision accordingly. Same bit budget, dramatically different quality.
96
 
97
+ ## Which file should I choose?
 
 
 
 
 
 
98
 
99
+ This file is 35 GB. Realistic RAM requirements:
100
 
101
+ - **≥64 GB RAM**: comfortable, full 128K context window
102
+ - **48 GB RAM**: works with 16K-32K context
103
+ - **32 GB RAM**: tight, short context only — consider the [32B variant](https://huggingface.co/aaardpark/Qwen2.5-32B-Instruct-GGUF) instead
104
+ - **<32 GB RAM**: use the 32B variant (15 GB)
105
 
106
+ On Apple Silicon with Metal offload (`-ngl 99`), expect ~5 tok/s on M5 Max. NVIDIA GPUs need ~40 GB VRAM for full offload.
 
 
107
 
108
+ ## Method
 
 
109
 
110
+ Importance-weighted per-group optimization. Calibration data identifies which weights are critical for model quality, then quantization precision is allocated accordingly. ~20 minutes per quant on a single GPU. Output is standard Q3_K_M GGUF format — no custom kernels required.
111
 
112
+ - **Group size**: 128
113
+ - **GGUF format**: Q3_K_M (via llama.cpp)
114
+ - **Context**: 128K tokens
 
 
 
 
 
115
 
116
  ## Acknowledgments
117
 
118
+ Built on [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) by Alibaba Cloud.