Instructions to use pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M", filename="qwen-2.5-32B-instruct-gguf-Q4-K-M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M # Run inference directly in the terminal: llama cli -hf pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M # Run inference directly in the terminal: llama cli -hf pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M # Run inference directly in the terminal: ./llama-cli -hf pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M # Run inference directly in the terminal: ./build/bin/llama-cli -hf pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
Use Docker
docker model run hf.co/pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
- LM Studio
- Jan
- Ollama
How to use pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M with Ollama:
ollama run hf.co/pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
- Unsloth Studio
How to use pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M to start chatting
- Pi
How to use pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M with Docker Model Runner:
docker model run hf.co/pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
- Lemonade
How to use pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M
Run and chat with the model
lemonade run user.qwen-2.5-32B-instruct-gguf-Q4-K-M-{{QUANT_TAG}}List all available models
lemonade list
- Qwen2.5-32B-Instruct ยท GGUF Q4_K_M
- Try This Model in the Live AI Agent Demo
- Model Description
- PBH Applied Systems Evaluation โ quant_eval v7.21
- Key Findings
- Signal-Level Diagnostics (Q4_K_M)
- Series Context โ Where 32B Fits
- Recommended Use Cases
- Hardware Requirements
- Usage
- Evaluation Artifacts
- Artifact Provenance
- Evaluation Methodology
- ๐ฌ About quant_eval & This Evaluation Series
- About PBH Applied Systems
- ๐ Work With PBH Applied Systems
- License
- Try This Model in the Live AI Agent Demo
Qwen2.5-32B-Instruct ยท GGUF Q4_K_M
Quantized, converted, and evaluated by PBH Applied Systems, LLC โ Applied AI/ML Consulting ยท LLM Optimization & Deployment ยท Quantized AI Infrastructure
๐ฌ This repository is part of a production-oriented evaluation series. Every model published under
pbhappliedsystemshas been independently evaluated using quant_eval v7.21 โ a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families โ not perplexity or benchmark leaderboard proxies.
โ ๏ธ Single-runner evaluation. No F16 baseline was evaluated for this model. The F16 GGUF (65.5 GB) was produced and its artifact hash is recorded, but it exceeds the VRAM capacity of the evaluation hardware (NVIDIA RTX 4090, 24 GB). All behavioral data in this card comes from the Q4_K_M
quantized_llama_cpprunner only. A separate F16 provenance card is published atpbhappliedsystems/qwen-2.5-32B-instruct-gguf-F16.
Try This Model in the Live AI Agent Demo
Launch the PBH Applied Systems AI Agent Demo โ
This model is part of the PBH Applied Systems live AI Agent Demo, where visitors can test evaluated quantized open-weight models across production-style agent workflows: reasoning and analysis, document intelligence, and code automation.
The demo uses quant_eval results to show how model selection changes by task. A model that performs well for long-context document analysis may not be the best choice for hard multi-step planning, strict tool-use workflows, or production code generation. Each deployed model is evaluated for practical agent behavior, including coherence, instruction following, reasoning, task completion, structured output reliability, tool-use behavior, and quantization impact.
For this repository, the Q4_K_M variant represents the deployment-focused model: smaller, faster, and more cost-efficient than the F16 baseline. The evaluation results below explain where this quantized model preserves useful behavior, where quantization introduces risk, and what guardrails are recommended before production deployment.
The purpose of the demo is simple: let prospects test the same kind of evaluated quantized models that PBH Applied Systems deploys for real agentic AI systems.
Model Description
This repository contains the 4-bit quantized (Q4_K_M) GGUF of Qwen/Qwen2.5-32B-Instruct, a 32-billion parameter instruction-tuned model from Alibaba Cloud. At 19.9 GB, this is the largest evaluated model in the PBH Applied Systems series that can be run on a single 24 GB GPU.
Key Characteristics
- Parameters: 32B
- Format: GGUF Q4_K_M
- File size: 19.9 GB
- SHA256:
6f810a332a884410aa65cc1b5a128a8603f083b36465acfbbf67a08f50a4d3e3 - Minimum VRAM (GPU inference): ~24 GB
- Recommended GPU tier: RTX 4090 24 GB ยท A10G 24 GB ยท A100 40 GB
- Context window: 32,768 tokens
- Inference speed (eval hardware): avg 9.282 sec/case on RTX 4090
- License: Apache 2.0
- F16 equivalent: 65.5 GB โ exceeds single-GPU 24 GB VRAM; requires multi-GPU or large-memory server
PBH Applied Systems Evaluation โ quant_eval v7.21
Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID:
20260221_144732ยท Fixtures:golden_oracle_fixtures_v7_21(SHA256:6d71a0b9147c...) ยท Seed: 42 Hardware: NVIDIA RTX 4090 ยท Runner:quantized_llama_cpp(Q4_K_M only) ยท Total rows: 42
Per-Family Pass Rates (Q4_K_M)
| Family | N | Pass Rate | Avg Secs | Bucket Score | Notes |
|---|---|---|---|---|---|
| json_multistep | 5 | 0.600 | 30.950 | 1.800 | Two failures โ see below |
| stateful_followup | 2 | 1.000 | 4.485 | 2.000 | Both turns exact match |
| toolcall_only | 2 | 0.000 | 4.525 | 1.000 | Wrong wrapper keys โ see below |
| mixed_brief_json | 2 | 1.000 | 4.365 | 2.000 | Clean ANSWER + JSON |
| toolcall | 2 | 1.000 | 5.615 | 0.000 | Stage-1 passes; EOS on final answer |
| json | 4 | n/a | 10.540 | 10.000 | All pass |
| fuzz | 20 | n/a | 7.562 | 10.000 | All 20 pass |
| mcq | 5 | n/a | 0.744 | 1.000 | 5/5 perfect |
Key Findings
Finding 1: 32B Underperforms 7B and 14B-1M on json_multistep
This is the most unexpected result in the PBH Applied Systems evaluated series. Larger parameters do not guarantee better performance on structured multi-step planning tasks under Q4_K_M quantization.
| Model | json_multistep Pass Rate | Failing Cases |
|---|---|---|
| Qwen2.5-3B Q4_K_M | 0.200 | ms_easy_01, ms_easy_02, ms_med_01, ms_med_02 |
| Qwen2.5-7B Q4_K_M | 0.800 | ms_easy_02 only |
| Qwen2.5-14B-1M Q4_K_M | 0.800 | ms_easy_02 only |
| Qwen2.5-32B Q4_K_M | 0.600 | ms_easy_02 + ms_hard_01 |
The 7B and 14B-1M both pass ms_hard_01 cleanly (oracle_equiv_ok=1, checks_consistent_ok=1). The 32B fails it with both checks_consistent_ok=0 and oracle\_equiv\_ok=0. The model produces an internally inconsistent reasoning chain on the hard case โ a failure mode the smaller models do not exhibit on this fixture.
Case-level breakdown:
| Case | Difficulty | Result | Secs | Failure |
|---|---|---|---|---|
| ms_easy_01 | Easy | โ | 24.44 | โ |
| ms_easy_02 | Easy | โ | 28.39 | oracle_equiv_ok=0 only |
| ms_med_01 | Medium | โ | 33.92 | โ |
| ms_med_02 | Medium | โ | 34.07 | โ |
| ms_hard_01 | Hard | โ | 33.93 | cc=0, oe=0 |
ms_easy_02 fails with checks_consistent_ok=1, oracle_equiv_ok=0 โ the model reasons consistently but arrives at the wrong final plan. This is the same precision-invariant fixture failure observed across 7B and 14B-1M. ms_hard_01 is the additional failure: checks_consistent_ok=0, oracle_equiv_ok=0 โ the model's intermediate reasoning steps are inconsistent and the final plan is wrong.
What this means for production: When evaluating models purely by parameter count for multi-step planning tasks, the 7B or 14B-1M Q4_K_M variants may outperform the 32B on this class of structured reasoning under quantization. The evaluation data is the source of truth; parameter count is a proxy.
Finding 2: MCQ โ Perfect 5/5
| Case | Result | Raw |
|---|---|---|
| mcq_01 | โ | B |
| mcq_02 | โ | B |
| mcq_03 | โ | C |
| mcq_04 | โ | B |
| mcq_05 | โ | B |
No A-bias, clean single-letter extraction. Matches Qwen2.5-14B-1M's perfect MCQ result.
Finding 3: toolcall โ Correct Arithmetic, EOS Contamination
Both toolcall cases pass stage-1 (valid tool dispatch) but produce final_mismatch from EOS tokens โ the standard Qwen Q4_K_M series pattern:
| Case | Raw | Expected |
|---|---|---|
| tool_01 | {"tool_name": "add", "args": {"a": 2, "b": 3}}<|im_end|> 5<|im_end|> |
5 |
| tool_02 | {"tool_name": "add", "args": {"a": 10, "b": -4}}<|im_end|> 6<|im_end|> |
6 |
Strip <|im_end|> before downstream processing. Arithmetic is correct.
Finding 4: toolcall_only โ Closest to Correct Schema in the Series
All previous Qwen Q4_K_M models produce "numbers", "operands", or "input" as argument containers โ wrong key names with wrong structures. The 32B model produces a different wrong schema, but one that is structurally closer to correct:
| Model | toolonly_01 Raw | Key Names Correct? |
|---|---|---|
| Qwen2.5-3B Q4_K_M | {"tool": "add", "operands": [5, 10]} |
โ operands, array |
| Qwen2.5-7B Q4_K_M | {"tool": "add", "numbers": [5, 10]} |
โ numbers, array |
| Qwen2.5-14B-1M Q4_K_M | {"tool": "add", "input": {"x": 5, "y": 10}} |
โ input, x/y |
| Qwen2.5-32B Q4_K_M | {"tool": "add", "params": {"a": 5, "b": 10}} |
โ ๏ธ params, a/b correct |
The 32B model gets the argument value names right ("a": 5, "b": 10) and uses an object container โ the only model in the series to do so without explicit key-name enforcement. The failure is only in the outer wrapper: "tool" instead of "tool_name", and "params" instead of "args". A minimal system prompt specifying the correct outer key names should resolve this.
Finding 5: Stateful and Hybrid โ Clean at 32B
stateful_followup (1.000) and mixed_brief_json (1.000) both pass cleanly at expected timing:
| Family | Case | Raw |
|---|---|---|
| stateful | state_01 | {"counter": 2}<|im_end|> {"counter": 5}<|im_end|> |
| stateful | state_02 | {"items": ["a", "b"]}<|im_end|> {"items": ["a", "b", "c"]}<|im_end|> |
| mixed | mixed_01 | ANSWER: 13 {"a": 4, "b": 9, "sum": 13}<|im_end|> |
| mixed | mixed_02 | ANSWER: 6 {"a": -2, "b": 8, "sum": 6}<|im_end|> |
EOS tokens are present but extraction works correctly in both families.
Signal-Level Diagnostics (Q4_K_M)
json_multistep
| Signal | Rate | Notes |
|---|---|---|
| schema_ok | 1.000 | Perfect |
| checks_consistent_ok | 0.800 | ms_hard_01 fails |
| stop_semantics_ok | 1.000 | Perfect |
| oracle_equiv_ok | 0.600 | ms_easy_02 + ms_hard_01 |
stateful_followup
| Signal | Rate |
|---|---|
| turn1_parse_ok | 1.000 |
| turn2_parse_ok | 1.000 |
| turn1_exact_match | 1.000 |
| turn2_exact_match | 1.000 |
toolcall_only
| Signal | Rate | Notes |
|---|---|---|
| tool_name_ok | 1.000 | "add" recognized |
| args_ok | 0.000 | "params" instead of "args" |
mixed_brief_json
| Signal | Rate |
|---|---|
| answer_line_ok | 1.000 |
| json_parse_ok | 1.000 |
| schema_ok | 1.000 |
Series Context โ Where 32B Fits
| Model | json_multistep | stateful | mixed | MCQ | VRAM (Q4_K_M) |
|---|---|---|---|---|---|
| Qwen2.5-3B | 0.200 | 1.000 | 1.000 | 3/5 | ~4 GB |
| Qwen2.5-7B | 0.800 | 1.000 | 1.000 | 4/5 | ~6 GB |
| Qwen2.5-14B-1M | 0.800 | 1.000 | 1.000 | 5/5 | ~12 GB |
| Qwen2.5-32B | 0.600 | 1.000 | 1.000 | 5/5 | ~24 GB |
The 32B model is the best choice when the primary workload benefits from larger parameter capacity for tasks outside the evaluated battery โ generation quality, nuanced reasoning, long-form outputs, or language variety. On the specific evaluated families, the 14B-1M is a stronger choice for structured multi-step planning. Both are strong for stateful, hybrid JSON, and MCQ tasks.
Recommended Use Cases
โ Deploy with Confidence (Q4_K_M)
- Stateful multi-turn agents โ 1.000 at both turns, clean JSON state with strippable EOS.
- Hybrid brief + JSON responses โ
mixed_brief_json1.000 in 4.37 sec/case. - Structured JSON outputs (single-step) โ
jsonandfuzzboth bucket=10.000. - MCQ and single-choice extraction โ 5/5 perfect.
- Complex generation tasks โ 32B parameter capacity for nuanced, long-form, or multilingual outputs.
โ ๏ธ Use with Guardrails (Q4_K_M)
- Multi-step planning โ 0.600 pass rate. ms_easy_02 and ms_hard_01 fail. Add oracle validation. Consider Qwen2.5-7B or 14B-1M if planning reliability is the primary requirement.
- Scaffolded tool-calling โ
toolcallstage-1 passes; strip EOS from final answer. - Bare tool-call dispatch โ
toolcall_onlyfails on outer wrapper keys ("params"vs"args"). Specify exact key names in system prompt; the argument value names ("a","b") are already correct at 32B.
Hardware Requirements
| Configuration | VRAM Required | Notes |
|---|---|---|
| Q4_K_M (this repo) | ~24 GB | 19.9 GB model + KV cache |
| Q4_K_M ยท full context | ~28 GB | Requires A10G/A100 or RTX 4090 |
| F16 (provenance only) | ~80 GB+ | Multi-GPU or large memory server |
Usage
Installation
pip install llama-cpp-python huggingface_hub
For GPU acceleration (CUDA):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Python โ llama-cpp-python
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
# Note: 19.9 GB download โ requires ~24 GB VRAM for full GPU offload
model_path = hf_hub_download(
repo_id="pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M",
filename="qwen-2.5-32B-instruct-gguf-Q4-K-M.gguf"
)
llm = Llama(
model_path=model_path,
n_ctx=8192,
n_gpu_layers=-1,
verbose=False,
)
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a precise assistant. Follow instructions exactly and return structured outputs when requested."
},
{
"role": "user",
"content": "Analyze the following and return a JSON object with keys: summary, risk_level, action_items."
}
],
temperature=0.7,
max_tokens=1024,
)
print(response["choices"][0]["message"]["content"])
For tool-calling with EOS stripping:
import json, re
def call_tool_with_cleanup(prompt: str) -> dict:
"""
Tool dispatch with EOS stripping.
quant_eval v7.21: toolcall stage-1 pass=1.000; final_mismatch due to <|im_end|> suffix.
Arithmetic results are correct โ strip EOS before downstream processing.
"""
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": (
"You are a tool-calling assistant. Output the tool call as JSON, "
"then on the next line output only the numeric result.\n"
'Tool call format: {"tool_name": "<n>", "args": {"a": <n>, "b": <n>}}'
)
},
{"role": "user", "content": prompt}
],
temperature=0.0,
max_tokens=128,
)
raw = response["choices"][0]["message"]["content"]
clean = re.sub(r'<\|im_end\|>', '', raw).strip()
return {"clean": clean, "raw": raw}
result = call_tool_with_cleanup("Use the add tool to compute 10 minus 4.")
print(result["clean"])
For bare tool-call dispatch โ only outer wrapper keys need correction at 32B:
import json, re
def call_tool_bare(prompt: str, retries: int = 3) -> dict:
"""
Bare tool dispatch with schema correction.
quant_eval v7.21: 32B uses 'params' wrapper and correct 'a'/'b' arg names.
Only the outer keys need enforcement โ arg names are already correct.
"""
for attempt in range(retries):
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": (
'Respond ONLY with a JSON object using EXACTLY these keys:\n'
'{"tool_name": "add", "args": {"a": <integer>, "b": <integer>}}\n'
'No other text, no markdown.'
)
},
{"role": "user", "content": prompt}
],
temperature=0.0,
max_tokens=64,
)
raw = re.sub(r'<\|im_end\|>', '', response["choices"][0]["message"]["content"]).strip()
try:
parsed = json.loads(raw)
assert "tool_name" in parsed and "args" in parsed
assert "a" in parsed["args"] and "b" in parsed["args"]
return parsed
except (json.JSONDecodeError, AssertionError, KeyError):
if attempt == retries - 1:
raise ValueError(f"Tool call failed after {retries} attempts. Raw: {raw}")
result = call_tool_bare("Add 5 and 10.")
CLI โ llama-cli
llama-cli \
--model qwen-2.5-32B-instruct-gguf-Q4-K-M.gguf \
--chat-template qwen2 \
--system-prompt "You are a precise assistant. Follow instructions exactly." \
--prompt "Return a JSON object with keys: summary, risk_level, action_items." \
--n-predict 1024 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--temp 0.7
For server deployment:
llama-server \
--model qwen-2.5-32B-instruct-gguf-Q4-K-M.gguf \
--chat-template qwen2 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--port 8080 \
--host 0.0.0.0
Query via the OpenAI-compatible API:
from openai import OpenAI
import re
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")
response = client.chat.completions.create(
model="qwen-2.5-32B-instruct-gguf-Q4-K-M",
messages=[{"role": "user", "content": "Your prompt here"}],
temperature=0.7,
)
clean = re.sub(r'<\|im_end\|>', '', response.choices[0].message.content).strip()
print(clean)
Evaluation Artifacts
The full per-case evaluation CSV (comparison_results_v7_21_Qwen2.5_32B_Instruct_20260221_144732.csv) and rollup.json are published in this repository for independent verification.
Artifact Provenance
| Artifact | Format | Size | SHA256 | Evaluated |
|---|---|---|---|---|
qwen-2.5-32B-instruct-gguf-Q4-K-M.gguf |
GGUF Q4_K_M | 19.9 GB | 6f810a332a884410aa65cc1b5a128a8603f083b36465acfbbf67a08f50a4d3e3 |
โ Yes |
| F16 (companion repo) | GGUF F16 | 65.5 GB | 02e264f0273624b39b0650f8c0583c6d04c320c777780ca5be839999912adf3c |
โ VRAM constraint |
Both artifacts were produced from Qwen/Qwen2.5-32B-Instruct using a custom-built llama.cpp conversion and quantization pipeline developed by PBH Applied Systems. The F16 GGUF was produced and its provenance is recorded, but behavioral evaluation was not performed due to VRAM constraints.
Evaluation Methodology
quant_eval v7.21 โ proprietary behavioral evaluation harness, PBH Applied Systems.
Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)
| Family | Description | Pass Signals |
|---|---|---|
fuzz |
Property-based regression; structured placement correctness | schema_ok, constraints_ok |
json |
Single-step structured JSON with constraint rules | schema_ok, constraints_ok |
json_multistep |
Multi-step planning with self-check and oracle verification | schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok |
mcq |
Multiple-choice extraction | choice_ok |
stateful_followup |
Two-turn state tracking; turn-2 correct given turn-1 | turn1/2_parse_ok, turn1/2_exact_match |
mixed_brief_json |
Hybrid: natural language answer + valid JSON block | answer_line_ok, json_parse_ok, schema_ok |
toolcall |
Tool call embedded in response; parse + schema validation | stage1_tool_parse_ok, stage1_tool_schema_ok |
toolcall_only |
Bare schema-only tool call; strict tool name + args check | tool_name_ok, args_ok |
Evaluation hardware: NVIDIA RTX 4090 (24 GB) ยท Evaluation date: February 21, 2026 ยท Seed: 42
๐ฌ About quant_eval & This Evaluation Series
quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning โ not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.
See it in action: Live AI Agent Demo โ The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.
Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? โ pbhappliedsystems.com
Evaluated and published by PBH Applied Systems, LLC ยท patrick@pbhappliedsystems.com
About PBH Applied Systems
PBH Applied Systems, LLC is an Oklahoma Cityโbased applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development.
Patrick Hill, M.S. โ Founder ยท Data Scientist ยท AI/ML Engineer ยท Author of Applied Machine Learning: Concepts, Tools, and Case Studies (required reading, UAT CSC 373)
Core Service Areas: LLM Optimization & Deployment ยท AI Evaluation Frameworks ยท Agentic AI Infrastructure ยท Scalable AI Application Development ยท ML Pipeline Design & Analytics ยท Model & Agent Cataloging
๐ Work With PBH Applied Systems
The 32B result is a concrete example of why systematic evaluation matters: if you're selecting a model purely on parameter count to maximize structured planning reliability, this evaluation shows the 7B and 14B-1M Q4_K_M variants are the stronger choices at significantly lower hardware cost. The 32B earns its place for generation quality and complex tasks outside this battery โ but the evaluation tells you exactly where and where not to depend on it.
๐ Book a Scoping Call ยท ๐ Request an Evaluation Report โ from $2,500
Connect
| ๐ | pbhappliedsystems.com |
| ๐ง | patrick@pbhappliedsystems.com |
| ๐ผ | |
| โถ๏ธ | YouTube |
| ๐ธ | |
| ๐ |
License
This GGUF repository inherits the license of the base model:
Apache 2.0 โ Qwen/Qwen2.5-32B-Instruct
The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.
GGUF conversion, quantization, and behavioral evaluation performed by PBH Applied Systems, LLC ยท quant_eval v7.21 ยท Run ID: 20260221_144732
- Downloads last month
- 48
We're not able to determine the quantization variants.
ollama run hf.co/pbhappliedsystems/qwen-2.5-32B-instruct-gguf-Q4-K-M