Instructions to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16", filename="qwen-2.5-14B-instruct-1m-gguf-F16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16 # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16 # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16 # Run inference directly in the terminal: ./llama-cli -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
Use Docker
docker model run hf.co/pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
- LM Studio
- Jan
- Ollama
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 with Ollama:
ollama run hf.co/pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
- Unsloth Studio
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 to start chatting
- Pi
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 with Docker Model Runner:
docker model run hf.co/pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
- Lemonade
How to use pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16:F16
Run and chat with the model
lemonade run user.qwen-2.5-14B-instruct-1m-gguf-F16-F16
List all available models
lemonade list
- Qwen2.5-14B-Instruct-1M · GGUF F16
- Try the Live AI Agent Demo
- Model Description
- PBH Applied Systems Evaluation — quant_eval v7.21
- F16-Specific Observations
- F16 vs. Q4_K_M — Deployment Decision
- Hardware Requirements
- Usage
- Artifact Provenance
- Evaluation Methodology
- 🔬 About quant_eval & This Evaluation Series
- About PBH Applied Systems
- 📞 Work With PBH Applied Systems
- License
- Try the Live AI Agent Demo
Qwen2.5-14B-Instruct-1M · GGUF F16
Converted and evaluated by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure
🔬 This repository is part of a production-oriented evaluation series. Every model published under
pbhappliedsystemshas been independently evaluated using quant_eval v7.21 — a proprietary behavioral evaluation harness developed by PBH Applied Systems.
📌 This is the full-precision F16 baseline repository. The evaluated Q4_K_M deployment variant is published at
pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-Q4-K-M. That card documents the complete cross-series comparisons, context window VRAM guide, and deployment recommendations. The Q4_K_M variant is the recommended choice for all deployments — it achieves identical behavioral results at 21.1× faster inference.
Try the Live AI Agent Demo
Launch the PBH Applied Systems AI Agent Demo →
This model is part of the PBH Applied Systems evaluated model series that supports the live AI Agent Demo. The demo lets visitors interact with production-style agent workflows powered by open-weight language models evaluated through PBH Applied Systems' quant_eval framework.
The F16 model serves a different role than the Q4_K_M deployment variant. F16 is the full-precision baseline used to measure what the model can do before quantization. quant_eval then compares the quantized model against this baseline to identify which capabilities are preserved, which degrade, and which tasks require guardrails or a higher-precision deployment.
This comparison is central to the demo. It helps determine which model belongs in which agent role:
- Reasoning models are selected for planning, analysis, and auditable decision workflows.
- Document models are selected for long-context extraction, summarization, and structured Q&A.
- Code models are selected for task completion, structured output, API scaffolding, and automation workflows.
- Quantized variants are selected when they preserve enough behavior to reduce cost, latency, and GPU requirements.
- F16 variants remain important when maximum fidelity, cleaner tool execution, or reduced quantization risk matters more than speed or cost.
The live demo shows the deployment side of that process. The F16 card documents the reference behavior. The Q4_K_M card shows what changes after compression. Together, they explain how PBH Applied Systems uses quant_eval to choose the correct LLM for the correct agent type instead of guessing from model size or leaderboard reputation.
Model Description
This repository contains the full-precision F16 GGUF of Qwen/Qwen2.5-14B-Instruct-1M, a 14-billion parameter instruction-tuned model from Alibaba Cloud featuring a 1,000,000-token context window.
In the PBH Applied Systems evaluation pipeline, this F16 run (20260210_215029) operated in cache-generation mode (skip_quant=true), producing the full_weight_cache.json used as the reference baseline for the subsequent Q4_K_M comparison run (20260210_235131). The evaluation results here are the source of the F16 baseline data shown in the Q4_K_M card — timing profiles and raw outputs are identical across both runs, confirming clean cache reuse and full run integrity.
Key Characteristics
- Parameters: 14B
- Format: GGUF F16 (full precision)
- File size: 29.5 GB
- SHA256:
de08ea9c41234ef83b7aacf07f9ebc3cbaa20ca8aeb5f6417758a8798660aaa9 - Context window: 1,000,000 tokens
- Minimum VRAM (GPU inference): ~32 GB (short context) — scales with context length
- Recommended GPU tier: A100 40 GB · 2× RTX 4090
- Inference speed (eval hardware): avg 56.623 sec/case on RTX 4090
- License: Apache 2.0
On inference speed: The F16 model averages 56.6 sec/case — nearly one minute per structured inference task on an RTX 4090. The json_01 case takes 376.99 seconds (over 6 minutes). For this model, the Q4_K_M variant (2.683 sec/case average) is the operationally viable choice on all but the highest-VRAM multi-GPU setups. Both produce identical behavioral results.
PBH Applied Systems Evaluation — quant_eval v7.21
Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID:
20260210_215029· Fixtures:golden_oracle_fixtures_v7_21(SHA256:6d71a0b9147c...) · Seed: 42 Hardware: NVIDIA RTX 4090 · Runner:full_weight_transformers(F16 only) · Total rows: 42
Per-Family Pass Rates — F16 (full_weight_transformers)
| Family | N | Pass Rate | Avg Secs | Bucket Score | Q4_K_M Parity |
|---|---|---|---|---|---|
| json_multistep | 5 | 0.800 | 133.21 | 2.200 | ✅ Identical |
| stateful_followup | 2 | 1.000 | 13.75 | 2.000 | ✅ Identical |
| toolcall_only | 2 | 0.000 | 15.36 | 1.000 | ✅ Identical |
| mixed_brief_json | 2 | 1.000 | 19.40 | 2.000 | ✅ Identical |
| toolcall | 2 | 1.000 | 25.95 | 11.000 | ✅ Identical |
| json | 4 | n/a | 143.87 | 10.000 | ✅ Identical |
| fuzz | 20 | n/a | 49.21 | 10.000 | ✅ Identical |
| mcq | 5 | n/a | 0.73 | 1.000 | ✅ Identical |
Every family result is identical between F16 and Q4_K_M. This model is the only one in the evaluated series with zero measurable quantization degradation across all behavioral families.
F16-Specific Observations
toolcall — bucket=11, Clean Final Answers at F16
Both toolcall cases pass with the maximum bucket score at F16 — no role-token contamination, no EOS tokens, no missing answers:
| Case | Raw Output | Expected | Result |
|---|---|---|---|
| tool_01 | {...add(2,3)...} 5 |
5 |
✅ bucket=11 |
| tool_02 | {...add(10,-4)...} 6 |
6 |
✅ bucket=11 |
This matches the Q4_K_M runner exactly. Qwen2.5-14B-Instruct-1M at F16 does not exhibit the role-token contamination (PARTICULAR: annotation, garbled prefixes) documented in the Qwen2.5-7B F16 evaluation, nor the EOS contamination of the smaller Qwen Q4_K_M variants. Clean output requires no post-processing at this precision level.
toolcall_only — "left"/"right" Schema Vocabulary at F16
Both toolcall_only cases use "left"/"right" as argument keys — the same vocabulary as the F16 runner for this model:
| Case | Raw Output |
|---|---|
| toolonly_01 | {"tool": "add", "left": 5, "right": 10} |
| toolonly_02 | {"tool": "add", "left": 25, "right": 75} |
Contrast with Q4_K_M which uses a nested "input" object. Both runners fail args_ok but with different wrong schemas. Explicit key names in the system prompt resolve this at both precision levels.
MCQ — Perfect 5/5 at F16
All five MCQ cases return a clean single-character answer in ~0.73 seconds:
mcq_01: B | mcq_02: B | mcq_03: C | mcq_04: B | mcq_05: B
No empty output, no invalid choices, no A-bias. At F16 MCQ is the fastest family in this run — the 1M context window doesn't affect short-response tasks.
json_01 — 376.99-Second Outlier
json_01 at F16 takes 376.99 seconds while json_02 through json_04 run 63–70 seconds each. The output is correct (bucket=10). This extreme variance is a characteristic of the 1M context window at full precision — certain inputs trigger substantially longer generation sequences before the model settles on its brief JSON output. The Q4_K_M runner takes 3.38 seconds on the same case, confirming this is a precision + context-window interaction, not a fixture complexity issue.
stateful_followup — No Turn-2 Contamination
Unlike the Qwen2.5-7B F16 evaluation (which showed PARTICULAR: annotation hallucinations on turn-2), this model produces clean JSON state at both turns with no appended text:
| Case | Turn 1 | Turn 2 |
|---|---|---|
| state_01 | {"counter": 2} |
{"counter": 5} |
| state_02 | {"items": ["a", "b"]} |
{"items": ["a", "b", "c"]} |
The F16 Transformers runner handles this model's stateful outputs cleanly.
F16 vs. Q4_K_M — Deployment Decision
| Dimension | F16 (this repo) | Q4_K_M |
|---|---|---|
| VRAM (8K context) | ~32 GB | ~12 GB |
| VRAM (128K context) | ~48 GB | ~20 GB |
| Avg inference time | 56.623 sec/case | 2.683 sec/case |
| Speed ratio | 1.0× (baseline) | 21.1× faster |
| All family pass rates | Same | Same |
| Toolcall final answer | Clean (bucket=11) | Clean (bucket=11) |
| MCQ | 5/5 | 5/5 |
| Behavioral difference | None | None |
For every practical deployment scenario, Q4_K_M is the correct choice. It achieves the same results in 21× less time at ~3× less VRAM. F16 is appropriate only when: (1) you have 32+ GB GPU VRAM available, (2) you require full-weight provenance for compliance or reproducibility auditing, or (3) you need the F16 baseline cache for a subsequent comparison evaluation run.
Hardware Requirements
| Configuration | VRAM Required | Notes |
|---|---|---|
| F16 (this repo) · 8K context | ~32 GB | 29.5 GB model + KV cache |
| F16 · 32K context | ~36 GB | Minimum A100 40 GB |
| F16 · 128K context | ~48 GB | A100 80 GB or multi-GPU |
| Q4_K_M (companion repo) · 8K | ~12 GB | 8.99 GB model + KV cache |
| Q4_K_M · 128K context | ~20 GB | A10G 24 GB · RTX 4090 |
Usage
Installation
pip install llama-cpp-python huggingface_hub
For GPU acceleration (CUDA):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Python — llama-cpp-python
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
# Note: 29.5 GB download — ensure sufficient disk space and ~32 GB VRAM
model_path = hf_hub_download(
repo_id="pbhappliedsystems/qwen-2.5-14B-instruct-1m-gguf-F16",
filename="qwen-2.5-14B-instruct-1m-gguf-F16.gguf"
)
llm = Llama(
model_path=model_path,
n_ctx=32768, # Set to actual working context; supports up to 1M
n_gpu_layers=-1,
verbose=False,
)
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a precise assistant. Follow instructions exactly."
},
{
"role": "user",
"content": "Analyze the following and return a JSON object with keys: summary, risk_level, action_items."
}
],
temperature=0.7,
max_tokens=1024,
)
print(response["choices"][0]["message"]["content"])
For tool-calling (no EOS stripping required — output is clean at F16):
# quant_eval v7.21: toolcall bucket=11 — clean final answer, no post-processing needed
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": (
"You are a tool-calling assistant. Output the tool call as JSON, "
"then on the next line output only the numeric result.\n"
'Tool call format: {"tool_name": "<n>", "args": {"a": <n>, "b": <n>}}'
)
},
{"role": "user", "content": "Use the add tool to compute 10 minus 4."}
],
temperature=0.7,
max_tokens=128,
)
# No stripping required at F16 — output is clean
print(response["choices"][0]["message"]["content"])
For bare tool-call dispatch with schema enforcement:
import json, re
def call_tool_bare(llm, prompt: str, retries: int = 3) -> dict:
"""
Explicit schema enforcement for toolcall_only.
quant_eval v7.21: F16 uses 'left'/'right' keys without schema guidance.
System prompt specifying exact keys resolves the vocabulary mismatch.
"""
for attempt in range(retries):
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": (
'Respond ONLY with a JSON object using EXACTLY these keys:\n'
'{"tool_name": "add", "args": {"a": <integer>, "b": <integer>}}\n'
'No other text, no markdown.'
)
},
{"role": "user", "content": prompt}
],
temperature=0.0,
max_tokens=64,
)
raw = response["choices"][0]["message"]["content"].strip()
try:
parsed = json.loads(raw)
assert "tool_name" in parsed and "args" in parsed
assert "a" in parsed["args"] and "b" in parsed["args"]
return parsed
except (json.JSONDecodeError, AssertionError, KeyError):
if attempt == retries - 1:
raise ValueError(f"Tool call failed after {retries} attempts. Raw: {raw}")
CLI — llama-cli
llama-cli \
--model qwen-2.5-14B-instruct-1m-gguf-F16.gguf \
--chat-template qwen2 \
--system-prompt "You are a precise assistant. Follow instructions exactly." \
--prompt "Return a JSON object with keys: summary, risk_level, action_items." \
--n-predict 1024 \
--ctx-size 32768 \
--n-gpu-layers -1 \
--temp 0.7
Artifact Provenance
| Artifact | Format | Size | SHA256 |
|---|---|---|---|
qwen-2.5-14B-instruct-1m-gguf-F16.gguf |
GGUF F16 | 29.5 GB | de08ea9c41234ef83b7aacf07f9ebc3cbaa20ca8aeb5f6417758a8798660aaa9 |
| Q4_K_M (companion repo) | GGUF Q4_K_M | 8.99 GB | 5ad529ff2b1b192f31c8a638fe8756a0c628904e2ded797c11f9194216976973 |
The F16 GGUF was converted from Qwen/Qwen2.5-14B-Instruct-1M using a custom-built llama.cpp conversion pipeline developed by PBH Applied Systems.
Two-pass architecture: This F16 run (20260210_215029) operated in cache-generation mode (skip_quant=true). The resulting full_weight_cache.json was used as the reference baseline for the Q4_K_M comparison run (20260210_235131). Timing identity between this run and the F16 baseline entries in the comparison run confirms clean cache reuse and run integrity.
Evaluation Methodology
quant_eval v7.21 — proprietary behavioral evaluation harness, PBH Applied Systems.
Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)
Evaluation hardware: NVIDIA RTX 4090 · F16 evaluation date: February 10, 2026 · Seed: 42
🔬 About quant_eval & This Evaluation Series
quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning — not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.
See it in action: Live AI Agent Demo → The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.
Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? → pbhappliedsystems.com
Evaluated and published by PBH Applied Systems, LLC · patrick@pbhappliedsystems.com
About PBH Applied Systems
PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development.
Patrick Hill, M.S. — Founder · Data Scientist · AI/ML Engineer · Author of Applied Machine Learning: Concepts, Tools, and Case Studies (required reading, UAT CSC 373)
📞 Work With PBH Applied Systems
The F16 baseline for this model exists because proper quantization evaluation requires it — you cannot measure what Q4_K_M preserves or degrades without a verified full-precision reference. For this model, the answer is: zero degradation. That finding only becomes a deployable fact when both runs exist and can be compared against the same fixture set.
👉 Book a Scoping Call · 👉 Request an Evaluation Report — from $2,500
Connect
| 🌐 | pbhappliedsystems.com |
| 📧 | patrick@pbhappliedsystems.com |
| 💼 | |
| ▶️ | YouTube |
| 📸 | |
| 👍 |
License
This GGUF repository inherits the license of the base model:
Apache 2.0 — Qwen/Qwen2.5-14B-Instruct-1M
The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.
GGUF conversion and behavioral evaluation performed by PBH Applied Systems, LLC · quant_eval v7.21 · F16 Run ID: 20260210_215029
- Downloads last month
- 26
16-bit