aaardpark's picture
Upload README.md with huggingface_hub
a6d7cf0 verified
|
Raw
History Blame
2.91 kB
metadata
license: apache-2.0
base_model: Qwen/Qwen2.5-72B-Instruct
tags:
  - quantized
  - gguf
  - 3-bit
  - qwen2
model_type: qwen2
quantized_by: aaardpark

Qwen2.5-72B-Instruct — 3-bit GGUF (aaardpark)

Qwen2.5-72B-Instruct quantized to 3-bit using a new importance-weighted quantization method. Produces significantly better quality at 3-bit than standard RTN or naive quantization approaches.

Key Results (Base Model Benchmarks)

Metric FP16 This Quant (3-bit) RTN 3-bit
Perplexity 2.670 3.163 3.750
GSM8K (5-shot) 90% 88% 16%
MMLU avg (5-shot) 77.6% 76.8% 73.0%
TruthfulQA 58.5% 56.9% 56.3%

Benchmarks measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quantization method is identical for both base and Instruct variants.

vs Other Quantization Methods

Method Bits PPL (72B) GSM8K Notes
FP16 16 2.670 90% Baseline
This quant 3 3.163 88% 35 GB
RTN 3-bit 3 3.750 16% Standard rounding
GPTQ 4-bit 4 3.562* 25% larger file
RTN 4-bit 4 2.790 88% 45 GB
This quant (4-bit) 4 2.747 93% Effectively lossless

*GPTQ 4-bit PPL from Qwen2.5-32B (3.562), scaled comparison.

On smaller models (7B): GPTQ 3-bit PPL = 12.576, our 3-bit PPL = 6.148. GPTQ is unusable at 3-bit; ours is not.

GGUF Perplexity (wikitext-2, llama.cpp)

Variant PPL
Base Q8_0 (exact weights) 3.028
Base Q3_K_M (this format) 2.904
Instruct Q3_K_M 3.962

Why This Quant is Different

Standard 3-bit quantization (RTN) rounds each weight to the nearest grid point uniformly. This destroys the precise weight values that control multi-step reasoning — GSM8K drops from 90% to 16%.

Our method uses calibration data to identify which weights are critical for model quality, then allocates quantization precision accordingly. The result: 88% GSM8K at 3-bit, nearly matching FP16.

Details

  • Method: Importance-weighted per-group optimization
  • Group size: 128
  • Quantization time: ~20 minutes on a single GPU
  • GGUF format: Q3_K_M (converted via llama.cpp)
  • File size: 35 GB
  • Context: 128K tokens

How to Use

Works with llama.cpp, Ollama, LM Studio, or any GGUF-compatible runtime.

# llama.cpp
llama-cli -m Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf -ngl 99 -p "Hello!"

# Ollama
ollama run aaardpark/qwen2.5-72b-instruct

Chat Template

This model uses the ChatML template:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

Acknowledgments

Built on Qwen2.5-72B-Instruct by Alibaba Cloud.