Instructions to use aaardpark/Qwen2.5-72B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use aaardpark/Qwen2.5-72B-Instruct-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="aaardpark/Qwen2.5-72B-Instruct-GGUF", filename="Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use aaardpark/Qwen2.5-72B-Instruct-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M # Run inference directly in the terminal: llama cli -hf aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M # Run inference directly in the terminal: llama cli -hf aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M # Run inference directly in the terminal: ./llama-cli -hf aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
Use Docker
docker model run hf.co/aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
- LM Studio
- Jan
- Ollama
How to use aaardpark/Qwen2.5-72B-Instruct-GGUF with Ollama:
ollama run hf.co/aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
- Unsloth Studio
How to use aaardpark/Qwen2.5-72B-Instruct-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for aaardpark/Qwen2.5-72B-Instruct-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for aaardpark/Qwen2.5-72B-Instruct-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for aaardpark/Qwen2.5-72B-Instruct-GGUF to start chatting
- Pi
How to use aaardpark/Qwen2.5-72B-Instruct-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use aaardpark/Qwen2.5-72B-Instruct-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use aaardpark/Qwen2.5-72B-Instruct-GGUF with Docker Model Runner:
docker model run hf.co/aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
- Lemonade
How to use aaardpark/Qwen2.5-72B-Instruct-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull aaardpark/Qwen2.5-72B-Instruct-GGUF:Q3_K_M
Run and chat with the model
lemonade run user.Qwen2.5-72B-Instruct-GGUF-Q3_K_M
List all available models
lemonade list
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -10,22 +10,70 @@ model_type: qwen2
|
|
| 10 |
quantized_by: aaardpark
|
| 11 |
---
|
| 12 |
|
| 13 |
-
# Qwen2.5-72B-Instruct —
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
| Metric | FP16 | This Quant (3-bit) | RTN 3-bit |
|
| 20 |
|--------|------|---------------------|-----------|
|
| 21 |
-
| **Perplexity** | 2.670 | **3.163** | 3.750 |
|
| 22 |
| **GSM8K** (5-shot) | 90% | **88%** | 16% |
|
| 23 |
| **MMLU avg** (5-shot) | 77.6% | **76.8%** | 73.0% |
|
| 24 |
| TruthfulQA | 58.5% | 56.9% | 56.3% |
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
| Method | Bits | PPL (72B) | GSM8K | Notes |
|
| 31 |
|--------|------|-----------|-------|-------|
|
|
@@ -40,51 +88,31 @@ Benchmarks measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quanti
|
|
| 40 |
|
| 41 |
On smaller models (7B): GPTQ 3-bit PPL = 12.576, our 3-bit PPL = 6.148. GPTQ is unusable at 3-bit; ours is not.
|
| 42 |
|
| 43 |
-
##
|
| 44 |
-
|
| 45 |
-
| Variant | PPL |
|
| 46 |
-
|---------|-----|
|
| 47 |
-
| Base Q8_0 (exact weights) | 3.028 |
|
| 48 |
-
| Base Q3_K_M (this format) | 2.904 |
|
| 49 |
-
| Instruct Q3_K_M | 3.962 |
|
| 50 |
-
|
| 51 |
-
## Why This Quant is Different
|
| 52 |
|
| 53 |
Standard 3-bit quantization (RTN) rounds each weight to the nearest grid point uniformly. This destroys the precise weight values that control multi-step reasoning — GSM8K drops from 90% to 16%.
|
| 54 |
|
| 55 |
-
Our method uses calibration data to identify which weights are critical for model quality, then allocates quantization precision accordingly.
|
| 56 |
|
| 57 |
-
##
|
| 58 |
-
- **Method**: Importance-weighted per-group optimization
|
| 59 |
-
- **Group size**: 128
|
| 60 |
-
- **Quantization time**: ~20 minutes on a single GPU
|
| 61 |
-
- **GGUF format**: Q3_K_M (converted via llama.cpp)
|
| 62 |
-
- **File size**: 35 GB
|
| 63 |
-
- **Context**: 128K tokens
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
``
|
| 70 |
-
# llama.cpp
|
| 71 |
-
llama-cli -m Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf -ngl 99 -p "Hello!"
|
| 72 |
|
| 73 |
-
#
|
| 74 |
-
ollama run aaardpark/qwen2.5-72b-instruct
|
| 75 |
-
```
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
You are a helpful assistant.<|im_end|>
|
| 83 |
-
<|im_start|>user
|
| 84 |
-
{prompt}<|im_end|>
|
| 85 |
-
<|im_start|>assistant
|
| 86 |
-
```
|
| 87 |
|
| 88 |
## Acknowledgments
|
| 89 |
|
| 90 |
-
Built on [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) by Alibaba Cloud.
|
|
|
|
| 10 |
quantized_by: aaardpark
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Qwen2.5-72B-Instruct — GGUF (aaardpark)
|
| 14 |
|
| 15 |
+
**35 GB Q3_K_M GGUF. 88% GSM8K, where standard 3-bit quantization (RTN) gets 16% at the same size.**
|
| 16 |
|
| 17 |
+
> Looking for a smaller version? See [aaardpark/Qwen2.5-32B-Instruct-GGUF](https://huggingface.co/aaardpark/Qwen2.5-32B-Instruct-GGUF) — 15 GB, fits on a 24 GB machine.
|
| 18 |
+
|
| 19 |
+
## Quick stats
|
| 20 |
+
|
| 21 |
+
| File | Size | BPW | Min RAM | Speed (M5 Max, Metal) |
|
| 22 |
+
|---|---|---|---|---|
|
| 23 |
+
| `Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf` | 35 GB | 3.9 | 48 GB | ~5 tok/s |
|
| 24 |
+
|
| 25 |
+
## How to use
|
| 26 |
+
|
| 27 |
+
### Download
|
| 28 |
+
|
| 29 |
+
```bash
|
| 30 |
+
huggingface-cli download aaardpark/Qwen2.5-72B-Instruct-GGUF \
|
| 31 |
+
Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf --local-dir .
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
### Run
|
| 35 |
+
|
| 36 |
+
**llama.cpp:**
|
| 37 |
+
```bash
|
| 38 |
+
llama-cli -m Qwen2.5-72B-Instruct-aaardpark-Q3_K_M.gguf -ngl 99 -p "Hello!"
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
**LM Studio:** Search for `aaardpark/Qwen2.5-72B-Instruct-GGUF` in the model browser.
|
| 42 |
+
|
| 43 |
+
## Prompt format
|
| 44 |
+
|
| 45 |
+
This model uses the ChatML template:
|
| 46 |
+
|
| 47 |
+
```
|
| 48 |
+
<|im_start|>system
|
| 49 |
+
You are a helpful assistant.<|im_end|>
|
| 50 |
+
<|im_start|>user
|
| 51 |
+
{prompt}<|im_end|>
|
| 52 |
+
<|im_start|>assistant
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Benchmarks
|
| 56 |
+
|
| 57 |
+
### Base model evaluation (lm-evaluation-harness)
|
| 58 |
|
| 59 |
| Metric | FP16 | This Quant (3-bit) | RTN 3-bit |
|
| 60 |
|--------|------|---------------------|-----------|
|
| 61 |
+
| **Perplexity** (wikitext-2) | 2.670 | **3.163** | 3.750 |
|
| 62 |
| **GSM8K** (5-shot) | 90% | **88%** | 16% |
|
| 63 |
| **MMLU avg** (5-shot) | 77.6% | **76.8%** | 73.0% |
|
| 64 |
| TruthfulQA | 58.5% | 56.9% | 56.3% |
|
| 65 |
|
| 66 |
+
Measured on Qwen2.5-72B (base) with lm-evaluation-harness. The quantization method is identical for base and Instruct variants.
|
| 67 |
|
| 68 |
+
### GGUF perplexity (wikitext-2, llama.cpp)
|
| 69 |
+
|
| 70 |
+
| Variant | PPL |
|
| 71 |
+
|---------|-----|
|
| 72 |
+
| Base Q8_0 (exact weights) | 3.028 |
|
| 73 |
+
| Base Q3_K_M (this format) | 2.904 |
|
| 74 |
+
| Instruct Q3_K_M | 3.962 |
|
| 75 |
+
|
| 76 |
+
### vs other quantization methods
|
| 77 |
|
| 78 |
| Method | Bits | PPL (72B) | GSM8K | Notes |
|
| 79 |
|--------|------|-----------|-------|-------|
|
|
|
|
| 88 |
|
| 89 |
On smaller models (7B): GPTQ 3-bit PPL = 12.576, our 3-bit PPL = 6.148. GPTQ is unusable at 3-bit; ours is not.
|
| 90 |
|
| 91 |
+
## Why this quant is different
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
Standard 3-bit quantization (RTN) rounds each weight to the nearest grid point uniformly. This destroys the precise weight values that control multi-step reasoning — GSM8K drops from 90% to 16%.
|
| 94 |
|
| 95 |
+
Our method uses calibration data to identify which weights are critical for model quality, then allocates quantization precision accordingly. Same bit budget, dramatically different quality.
|
| 96 |
|
| 97 |
+
## Which file should I choose?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
+
This file is 35 GB. Realistic RAM requirements:
|
| 100 |
|
| 101 |
+
- **≥64 GB RAM**: comfortable, full 128K context window
|
| 102 |
+
- **48 GB RAM**: works with 16K-32K context
|
| 103 |
+
- **32 GB RAM**: tight, short context only — consider the [32B variant](https://huggingface.co/aaardpark/Qwen2.5-32B-Instruct-GGUF) instead
|
| 104 |
+
- **<32 GB RAM**: use the 32B variant (15 GB)
|
| 105 |
|
| 106 |
+
On Apple Silicon with Metal offload (`-ngl 99`), expect ~5 tok/s on M5 Max. NVIDIA GPUs need ~40 GB VRAM for full offload.
|
|
|
|
|
|
|
| 107 |
|
| 108 |
+
## Method
|
|
|
|
|
|
|
| 109 |
|
| 110 |
+
Importance-weighted per-group optimization. Calibration data identifies which weights are critical for model quality, then quantization precision is allocated accordingly. ~20 minutes per quant on a single GPU. Output is standard Q3_K_M GGUF format — no custom kernels required.
|
| 111 |
|
| 112 |
+
- **Group size**: 128
|
| 113 |
+
- **GGUF format**: Q3_K_M (via llama.cpp)
|
| 114 |
+
- **Context**: 128K tokens
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
## Acknowledgments
|
| 117 |
|
| 118 |
+
Built on [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) by Alibaba Cloud.
|