Instructions to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M", filename="qwen-2.5-14B-instruct-1m-gguf-Q4-K-M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M # Run inference directly in the terminal: ./llama-cli -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M # Run inference directly in the terminal: ./build/bin/llama-cli -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
Use Docker
docker model run hf.co/pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
- LM Studio
- Jan
- Ollama
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M with Ollama:
ollama run hf.co/pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
- Unsloth Studio
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M to start chatting
- Pi
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M with Docker Model Runner:
docker model run hf.co/pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
- Lemonade
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M
Run and chat with the model
lemonade run user.qwen-2.5-14B-instruct-1m-gguf-Q4-K-M-{{QUANT_TAG}}List all available models
lemonade list
- Qwen2.5-14B-Instruct-1M ยท GGUF Q4_K_M
- Try This Model in the Live AI Agent Demo
- Model Description
- PBH Applied Systems Evaluation โ quant_eval v7.21
- Key Findings
- Finding 1: Perfect Quantization Parity โ Zero Degradation Across All Families
- Finding 2: toolcall โ Bucket=11 at Both Precision Levels
- Finding 3: MCQ โ Perfect 5/5 at Both Precision Levels
- Finding 4: json_multistep โ Single Precision-Invariant Failure
- Finding 5: toolcall_only โ Schema Vocabulary Differs by Runner
- Finding 6: F16 json_01 โ 376-Second Outlier
- Signal-Level Diagnostics (Q4_K_M = F16 โ Identical)
- Recommended Use Cases
- Context Window vs. VRAM Guide (Q4_K_M)
- Hardware Requirements
- Usage
- Evaluation Artifacts
- Artifact Provenance
- Evaluation Methodology
- ๐ฌ About quant_eval & This Evaluation Series
- About PBH Applied Systems
- ๐ Work With PBH Applied Systems
- License
- Try This Model in the Live AI Agent Demo
Qwen2.5-14B-Instruct-1M ยท GGUF Q4_K_M
Quantized, converted, and evaluated by PBH Applied Systems, LLC โ Applied AI/ML Consulting ยท LLM Optimization & Deployment ยท Quantized AI Infrastructure
๐ฌ This repository is part of a production-oriented evaluation series. Every model published under
pbhappliedsystemshas been independently evaluated using quant_eval v7.21 โ a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families โ not perplexity or benchmark leaderboard proxies.
๐ Top scores in the evaluated series. Qwen2.5-14B-Instruct-1M Q4_K_M achieves the highest reasoning score (0.9907) and highest instruction-following score (0.9902) across all 8 models evaluated. It is also the only model in the series to achieve perfect MCQ extraction (5/5) and full toolcall accuracy (bucket_score=11) at both F16 and Q4_K_M precision levels.
Try This Model in the Live AI Agent Demo
Launch the PBH Applied Systems AI Agent Demo โ
This model is part of the PBH Applied Systems live AI Agent Demo, where visitors can test evaluated quantized open-weight models across production-style agent workflows: reasoning and analysis, document intelligence, and code automation.
The demo uses quant_eval results to show how model selection changes by task. A model that performs well for long-context document analysis may not be the best choice for hard multi-step planning, strict tool-use workflows, or production code generation. Each deployed model is evaluated for practical agent behavior, including coherence, instruction following, reasoning, task completion, structured output reliability, tool-use behavior, and quantization impact.
For this repository, the Q4_K_M variant represents the deployment-focused model: smaller, faster, and more cost-efficient than the F16 baseline. The evaluation results below explain where this quantized model preserves useful behavior, where quantization introduces risk, and what guardrails are recommended before production deployment.
The purpose of the demo is simple: let prospects test the same kind of evaluated quantized models that PBH Applied Systems deploys for real agentic AI systems.
Model Description
This repository contains the 4-bit quantized (Q4_K_M) GGUF of Qwen/Qwen2.5-14B-Instruct-1M, a 14-billion parameter instruction-tuned model from Alibaba Cloud featuring a 1,000,000-token (1M) context window โ the largest context window in the PBH Applied Systems evaluated series by a factor of more than 30ร.
The full-precision F16 baseline is published separately at pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16.
Key Characteristics
- Parameters: 14B
- Format: GGUF Q4_K_M
- File size: 8.99 GB
- SHA256:
5ad529ff2b1b192f31c8a638fe8756a0c628904e2ded797c11f9194216976973 - Context window: 1,000,000 tokens
- Minimum VRAM (GPU inference): ~12 GB (short context) โ scales with context length
- Recommended GPU tier: A10G 24 GB ยท RTX 4090 ยท T4 16 GB (short context)
- Q4_K_M avg inference time (eval hardware): 2.683 sec/case on RTX 4090
- License: Apache 2.0
Context window and VRAM: The 1M context window requires substantial KV cache VRAM at full utilization. At ~8K tokens the model fits the ~12 GB baseline. At 128K tokens expect ~26 GB total. At 1M tokens expect ~80+ GB and multi-GPU or CPU offload configurations. For most production deployments, set
n_ctxto the actual context length needed โ not the maximum.
PBH Applied Systems Evaluation โ quant_eval v7.21
Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID:
20260210_235131ยท Fixtures:golden_oracle_fixtures_v7_21(SHA256:6d71a0b9147c...) ยท Seed: 42 Hardware: NVIDIA RTX 4090 ยท Total rows evaluated: 84 (42 F16 ยท 42 Q4_K_M)
Aggregate Scores (Q4_K_M)
Scores are normalized to [0.0 โ 1.0]. Higher is better.
| Dimension | Score | Series Rank |
|---|---|---|
| Task Completion | 0.6857 | Mid-tier |
| Reasoning | 0.9907 | ๐ #1 in series |
| Coherence | 0.9259 | Top tier |
| Instruction Following | 0.9902 | ๐ #1 in series |
| Avg inference time | 2.683 sec/case | โ |
Per-Family Pass Rates
A defining characteristic of this evaluation: the F16 and Q4_K_M runners produce identical pass rates across every single family. This is the only model in the evaluated series where precision level has zero measurable impact on structured behavioral outcomes.
F16 Baseline (full_weight_transformers)
| Family | N | Pass Rate | Avg Secs | Bucket Score | Notes |
|---|---|---|---|---|---|
| json_multistep | 5 | 0.800 | 133.21 | 2.200 | ms_easy_02 only failure |
| stateful_followup | 2 | 1.000 | 13.75 | 2.000 | Both turns exact match |
| toolcall_only | 2 | 0.000 | 15.36 | 1.000 | Wrong schema vocabulary |
| mixed_brief_json | 2 | 1.000 | 19.40 | 2.000 | Clean ANSWER + JSON |
| toolcall | 2 | 1.000 | 25.95 | 11.000 | ๐ Perfect โ clean final answer |
| json | 4 | n/a | 143.87 | 10.000 | json_01 = 376.99s outlier |
| fuzz | 20 | n/a | 49.21 | 10.000 | All 20 pass |
| mcq | 5 | n/a | 0.73 | 1.000 | ๐ 5/5 perfect |
Q4_K_M (quantized_llama_cpp)
| Family | N | Pass Rate | ฮ vs F16 | Avg Secs | Bucket Score | Notes |
|---|---|---|---|---|---|---|
| json_multistep | 5 | 0.800 | 0.000 | 7.26 | 2.200 | Same single failure |
| stateful_followup | 2 | 1.000 | 0.000 | 1.09 | 2.000 | Clean JSON state |
| toolcall_only | 2 | 0.000 | 0.000 | 1.08 | 1.000 | Wrong schema vocabulary |
| mixed_brief_json | 2 | 1.000 | 0.000 | 1.01 | 2.000 | Clean ANSWER + JSON |
| toolcall | 2 | 1.000 | 0.000 | 1.64 | 11.000 | ๐ Perfect โ clean final answer |
| json | 4 | n/a | โ | 3.34 | 10.000 | All pass |
| fuzz | 20 | n/a | โ | 2.63 | 10.000 | All 20 pass |
| mcq | 5 | n/a | โ | 0.17 | 1.000 | ๐ 5/5 perfect |
Key Findings
Finding 1: Perfect Quantization Parity โ Zero Degradation Across All Families
This is the only model in the PBH Applied Systems evaluated series where quantization produces zero measurable behavioral change. Every pass rate, every bucket score, every signal rate is identical between F16 and Q4_K_M:
| Family | F16 Pass Rate | Q4_K_M Pass Rate | Degradation |
|---|---|---|---|
| json_multistep | 0.800 | 0.800 | None |
| stateful_followup | 1.000 | 1.000 | None |
| toolcall_only | 0.000 | 0.000 | None |
| mixed_brief_json | 1.000 | 1.000 | None |
| toolcall | 1.000 | 1.000 | None |
| fuzz bucket | 10.000 | 10.000 | None |
| MCQ bucket | 1.000 | 1.000 | None |
What this means for deployment: The Q4_K_M variant is a fully faithful quantization for behavioral purposes. There is no capability tradeoff at this precision level for this model โ only a hardware and speed benefit (21.1ร faster, ~3ร less VRAM).
Finding 2: toolcall โ Bucket=11 at Both Precision Levels
toolcall achieves the maximum possible bucket score (11.000) at both runners. This means the tool call JSON is correctly formed, schema-valid, and the final answer is correct and cleanly extracted.
| Case | F16 Raw | Q4_K_M Raw | Expected |
|---|---|---|---|
| tool_01 | {...add(2,3)...} 5 |
{...add(2,3)...} 5 |
5 โ
|
| tool_02 | {...add(10,-4)...} 6 |
{...add(10,-4)...} 6 |
6 โ
|
No EOS token contamination. No role token prefix. No final_mismatch. This is the only model in the evaluated series where toolcall achieves a clean pass with correct final answers at both precision levels. Every other model in the series has either final_mismatch from EOS contamination (Qwen Q4_K_M series), role-token contamination (Qwen F16 series), or no answer emitted (Mistral series).
Finding 3: MCQ โ Perfect 5/5 at Both Precision Levels
| Case | F16 | Q4_K_M |
|---|---|---|
| mcq_01 | โ B | โ B |
| mcq_02 | โ B | โ B |
| mcq_03 | โ C | โ C |
| mcq_04 | โ B | โ B |
| mcq_05 | โ B | โ B |
Perfect MCQ performance at both precision levels. No A-bias, no empty output, no invalid choices. The same MCQ fixture that causes failures across other models in the series (including mcq_02 which exhibits systematic A-bias in smaller models) is answered correctly here.
Finding 4: json_multistep โ Single Precision-Invariant Failure
Only ms_easy_02 fails, identically on both runners, with oracle_equiv_ok=0, checks_consistent_ok=1. This is the same fixture that also fails for Qwen2.5-7B at both precision levels โ suggesting a model-agnostic characteristic of this specific test case rather than a 14B capability gap.
| Case | F16 | Q4_K_M | Secs (F16) | Secs (Q4) |
|---|---|---|---|---|
| ms_easy_01 | โ | โ | 108.93 | 5.63 |
| ms_easy_02 | โ | โ | 125.48 | 6.78 |
| ms_med_01 | โ | โ | 149.31 | 8.05 |
| ms_med_02 | โ | โ | 144.80 | 7.86 |
| ms_hard_01 | โ | โ | 137.51 | 7.98 |
Finding 5: toolcall_only โ Schema Vocabulary Differs by Runner
Both runners fail toolcall_only on args_ok, but with distinct wrong schemas:
| Runner | toolonly_01 raw | toolonly_02 raw |
|---|---|---|
| F16 | {"tool": "add", "left": 5, "right": 10} |
{"tool": "add", "left": 25, "right": 75} |
| Q4_K_M | {"tool": "add", "input": {"x": 5, "y": 10}} |
{"tool": "add", "input": {"numbers": [25, 75]}} |
The F16 model uses "left"/"right" as argument keys. The Q4_K_M model uses a nested "input" object with varying key names. Both use "tool" instead of "tool_name" as the outer key. Tool name recognition is perfect (1.000) at both runners โ the model identifies "add" correctly. The arg schema vocabulary is the failure point in both cases. Providing the exact expected schema in the system prompt would resolve this.
Finding 6: F16 json_01 โ 376-Second Outlier
json_01 at F16 takes 376.99 seconds โ nearly 6ร longer than the other three json cases (63โ70s). This is the most extreme single-case timing outlier in the evaluated series. The output is correct (bucket=10) and the model produces valid JSON placement decisions. The outlier reflects the 1M context window's capacity for extensive generation on certain inputs โ the model appears to generate far more internal content before settling on the brief JSON answer. The Q4_K_M runner takes 3.38s on the same case.
Signal-Level Diagnostics (Q4_K_M = F16 โ Identical)
json_multistep
| Signal | Rate | Notes |
|---|---|---|
| schema_ok | 1.000 | Perfect at both |
| checks_consistent_ok | 1.000 | Perfect at both |
| stop_semantics_ok | 1.000 | Perfect at both |
| oracle_equiv_ok | 0.800 | ms_easy_02 only โ precision-invariant |
stateful_followup
| Signal | Rate |
|---|---|
| turn1_parse_ok | 1.000 |
| turn2_parse_ok | 1.000 |
| turn1_exact_match | 1.000 |
| turn2_exact_match | 1.000 |
toolcall_only
| Signal | Rate |
|---|---|
| tool_name_ok | 1.000 |
| args_ok | 0.000 |
mixed_brief_json
| Signal | Rate |
|---|---|
| answer_line_ok | 1.000 |
| json_parse_ok | 1.000 |
| schema_ok | 1.000 |
Recommended Use Cases
โ Deploy with Confidence (Q4_K_M)
- Full-document and long-context processing โ The defining deployment advantage. 1M token context enables entire codebases, contracts, books, and conversation histories in a single context window. No other model in the evaluated series approaches this capability.
- Stateful multi-turn agents โ Perfect 1.000 at Q4_K_M in 1.09 sec/case. Clean JSON state output.
- Multi-step planning with external validation โ 0.800 pass rate with perfect internal consistency. Use with oracle validation for production reliability.
- Hybrid brief + JSON outputs โ
mixed_brief_json1.000 at 1.01 sec/case. - Tool-calling with response scaffolding and final answer โ
toolcallat bucket=11 โ tool dispatch valid, final answer correct, no cleanup required. The cleanest toolcall result in the series. - MCQ and single-choice extraction โ Perfect 5/5. No A-bias, no empty output.
- Structured JSON outputs (single-step) โ
jsonandfuzzboth bucket=10.000.
โ ๏ธ Use with Guardrails (Q4_K_M)
- Easy-difficulty multi-step planning โ ms_easy_02 fails at both precision levels. Add oracle validation.
- Bare tool-call dispatch โ
toolcall_onlyfails on args schema vocabulary. Provide exact key names in system prompt. - Long-context inference on constrained hardware โ At 128K+ token context, VRAM requirements scale significantly beyond the ~12 GB baseline. Plan for multi-GPU or CPU offload configurations.
Context Window vs. VRAM Guide (Q4_K_M)
| Context Length | Approx. KV Cache | Total VRAM Needed | Recommended Hardware |
|---|---|---|---|
| 8K tokens | ~0.5 GB | ~12 GB | T4 16 GB ยท RTX 3080 |
| 32K tokens | ~2 GB | ~14 GB | T4 16 GB ยท A10G |
| 64K tokens | ~4 GB | ~16 GB | A10G 24 GB ยท RTX 4090 |
| 128K tokens | ~8 GB | ~20 GB | A10G 24 GB ยท RTX 4090 |
| 256K tokens | ~16 GB | ~28 GB | A100 40 GB ยท 2ร A10G |
| 512K tokens | ~32 GB | ~44 GB | A100 80 GB ยท multi-GPU |
| 1M tokens | ~64 GB | ~76 GB | Multi-GPU ยท CPU offload |
Set n_ctx to your actual working context length, not the model maximum.
Hardware Requirements
| Configuration | VRAM Required | Notes |
|---|---|---|
| Q4_K_M ยท 8K context | ~12 GB | T4 16 GB ยท RTX 3080 |
| Q4_K_M ยท 128K context | ~20 GB | A10G 24 GB ยท RTX 4090 |
| Q4_K_M ยท 1M context | ~76 GB | Multi-GPU / CPU offload |
| F16 (companion repo) ยท 8K context | ~32 GB | A100 40 GB ยท RTX 4090 |
Usage
Installation
pip install llama-cpp-python huggingface_hub
For GPU acceleration (CUDA):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Python โ llama-cpp-python
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
model_path = hf_hub_download(
repo_id="pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M",
filename="qwen-2.5-14B-instruct-1m-gguf-Q4-K-M.gguf"
)
llm = Llama(
model_path=model_path,
n_ctx=32768, # Set to actual working context; supports up to 1M
n_gpu_layers=-1,
verbose=False,
)
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a precise assistant. Follow instructions exactly and return structured outputs when requested."
},
{
"role": "user",
"content": "Analyze the following document and return a JSON object with keys: summary, key_entities, risk_level, action_items."
}
],
temperature=0.7,
max_tokens=1024,
)
print(response["choices"][0]["message"]["content"])
For long-document processing (leveraging the 1M context window):
# Process a large document โ adjust n_ctx to actual document token count
with open("large_document.txt", "r") as f:
document = f.read()
# Estimate token count (~4 chars per token)
estimated_tokens = len(document) // 4
context_size = min(max(estimated_tokens + 2048, 8192), 1048576)
llm_long = Llama(
model_path=model_path,
n_ctx=context_size,
n_gpu_layers=-1,
verbose=True, # Monitor memory during large context loads
)
response = llm_long.create_chat_completion(
messages=[
{"role": "system", "content": "You are a document analysis expert."},
{"role": "user", "content": f"Summarize the following document and extract all action items:\n\n{document}"}
],
temperature=0.7,
max_tokens=2048,
)
print(response["choices"][0]["message"]["content"])
For tool-calling (clean results โ no EOS stripping required at Q4_K_M):
# quant_eval v7.21: toolcall bucket=11 at both runners โ clean final answer, no cleanup needed
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": (
"You are a tool-calling assistant. Output the tool call as JSON, "
'then on the next line output only the numeric result.\n'
'Tool call format: {"tool_name": "<n>", "args": {"a": <n>, "b": <n>}}'
)
},
{"role": "user", "content": "Use the add tool to compute 10 minus 4."}
],
temperature=0.7,
max_tokens=128,
)
print(response["choices"][0]["message"]["content"])
# No stripping required โ output is clean
For bare tool-call dispatch with explicit schema enforcement:
import json, re
def call_tool_bare(prompt: str, retries: int = 3) -> dict:
"""
Bare tool dispatch with explicit schema.
quant_eval v7.21: tool_name_ok=1.000, args_ok=0.000 โ model uses 'left'/'right' or nested 'input'.
Explicit schema in system prompt resolves the vocabulary issue.
"""
for attempt in range(retries):
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": (
'Respond ONLY with a JSON object using EXACTLY these keys:\n'
'{"tool_name": "add", "args": {"a": <integer>, "b": <integer>}}\n'
'No other text, no markdown, no explanation.'
)
},
{"role": "user", "content": prompt}
],
temperature=0.0,
max_tokens=64,
)
raw = response["choices"][0]["message"]["content"].strip()
try:
parsed = json.loads(raw)
assert "tool_name" in parsed and "args" in parsed
assert "a" in parsed["args"] and "b" in parsed["args"]
return parsed
except (json.JSONDecodeError, AssertionError, KeyError):
if attempt == retries - 1:
raise ValueError(f"Tool call failed after {retries} attempts. Raw: {raw}")
result = call_tool_bare("Add 5 and 10.")
CLI โ llama-cli
llama-cli \
--model qwen-2.5-14B-instruct-1m-gguf-Q4-K-M.gguf \
--chat-template qwen2 \
--system-prompt "You are a precise assistant. Follow instructions exactly." \
--prompt "Analyze the following and return a JSON object with keys: summary, risk_level, action_items." \
--n-predict 1024 \
--ctx-size 32768 \
--n-gpu-layers -1 \
--temp 0.15
For server deployment:
llama-server \
--model qwen-2.5-14B-instruct-1m-gguf-Q4-K-M.gguf \
--chat-template qwen2 \
--ctx-size 32768 \
--n-gpu-layers -1 \
--port 8080 \
--host 0.0.0.0
Query via the OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")
response = client.chat.completions.create(
model="qwen-2.5-14B-instruct-1m-gguf-Q4-K-M",
messages=[{"role": "user", "content": "Your prompt here"}],
temperature=0.7,
)
print(response.choices[0].message.content)
Evaluation Artifacts
The full per-case evaluation CSV (comparison_results_v7_21_Qwen2.5_14B_Instruct_1M_20260210_235131.csv) and rollup.json are published in this repository for independent verification.
Artifact Provenance
| Artifact | Format | Size | SHA256 |
|---|---|---|---|
qwen-2.5-14B-instruct-1m-gguf-Q4-K-M.gguf |
GGUF Q4_K_M | 8.99 GB | 5ad529ff2b1b192f31c8a638fe8756a0c628904e2ded797c11f9194216976973 |
| F16 (companion repo) | GGUF F16 | 29.5 GB | de08ea9c41234ef83b7aacf07f9ebc3cbaa20ca8aeb5f6417758a8798660aaa9 |
Both artifacts were produced from Qwen/Qwen2.5-14B-Instruct-1M using a custom-built llama.cpp conversion and quantization pipeline developed by PBH Applied Systems.
Evaluation Methodology
quant_eval v7.21 โ proprietary behavioral evaluation harness, PBH Applied Systems.
Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)
| Family | Description | Pass Signals |
|---|---|---|
fuzz |
Property-based regression; structured placement correctness | schema_ok, constraints_ok |
json |
Single-step structured JSON with constraint rules | schema_ok, constraints_ok |
json_multistep |
Multi-step planning with self-check and oracle verification | schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok |
mcq |
Multiple-choice extraction | choice_ok |
stateful_followup |
Two-turn state tracking; turn-2 correct given turn-1 | turn1/2_parse_ok, turn1/2_exact_match |
mixed_brief_json |
Hybrid: natural language answer + valid JSON block | answer_line_ok, json_parse_ok, schema_ok |
toolcall |
Tool call embedded in response; parse + schema validation | stage1_tool_parse_ok, stage1_tool_schema_ok |
toolcall_only |
Bare schema-only tool call; strict tool name + args check | tool_name_ok, args_ok |
Evaluation hardware: NVIDIA RTX 4090 ยท Evaluation date: February 10, 2026 ยท Seed: 42
๐ฌ About quant_eval & This Evaluation Series
quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning โ not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.
See it in action: Live AI Agent Demo โ The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.
Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? โ pbhappliedsystems.com
Evaluated and published by PBH Applied Systems, LLC ยท patrick@pbhappliedsystems.com
About PBH Applied Systems
PBH Applied Systems, LLC is an Oklahoma Cityโbased applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development.
Patrick Hill, M.S. โ Founder ยท Data Scientist ยท AI/ML Engineer ยท Author of Applied Machine Learning: Concepts, Tools, and Case Studies (required reading, UAT CSC 373)
Core Service Areas: LLM Optimization & Deployment ยท AI Evaluation Frameworks ยท Agentic AI Infrastructure ยท Scalable AI Application Development ยท ML Pipeline Design & Analytics ยท Model & Agent Cataloging
๐ Work With PBH Applied Systems
Qwen2.5-14B-Instruct-1M Q4_K_M is the highest-scoring model in the evaluated series across two dimensions, the only model with zero quantization degradation across all behavioral families, and the only model with clean toolcall final answers at both precision levels. That combination โ top scores, no quantization penalty, 1M context, and 21ร speedup at Q4_K_M โ is what systematic evaluation documents. It's not visible from a leaderboard score or a casual test.
๐ Book a Scoping Call ยท ๐ Request an Evaluation Report โ from $2,500
Connect
| ๐ | pbhappliedsystems.com |
| ๐ง | patrick@pbhappliedsystems.com |
| ๐ผ | |
| โถ๏ธ | YouTube |
| ๐ธ | |
| ๐ |
License
This GGUF repository inherits the license of the base model:
Apache 2.0 โ Qwen/Qwen2.5-14B-Instruct-1M
The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.
GGUF conversion, quantization, and behavioral evaluation performed by PBH Applied Systems, LLC ยท quant_eval v7.21 ยท Run ID: 20260210_235131
- Downloads last month
- 125
We're not able to determine the quantization variants.