Instructions to use geoffmunn/Qwen3-14B-f16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use geoffmunn/Qwen3-14B-f16 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="geoffmunn/Qwen3-14B-f16",
	filename="Qwen3-14B-f16-imatrix-4697-coder.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use geoffmunn/Qwen3-14B-f16 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Use Docker

docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M

LM Studio
Jan

vLLM

How to use geoffmunn/Qwen3-14B-f16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "geoffmunn/Qwen3-14B-f16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "geoffmunn/Qwen3-14B-f16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M

Ollama
How to use geoffmunn/Qwen3-14B-f16 with Ollama:
```
ollama run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
```

Unsloth Studio

How to use geoffmunn/Qwen3-14B-f16 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for geoffmunn/Qwen3-14B-f16 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for geoffmunn/Qwen3-14B-f16 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for geoffmunn/Qwen3-14B-f16 to start chatting

How to use geoffmunn/Qwen3-14B-f16 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "geoffmunn/Qwen3-14B-f16:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use geoffmunn/Qwen3-14B-f16 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default geoffmunn/Qwen3-14B-f16:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use geoffmunn/Qwen3-14B-f16 with Docker Model Runner:
```
docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
```

Lemonade

How to use geoffmunn/Qwen3-14B-f16 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull geoffmunn/Qwen3-14B-f16:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-14B-f16-Q4_K_M

List all available models

lemonade list

Qwen3-14B-f16 / README.md

geoffmunn

Update README.md

88a8c00 verified 4 months ago

17.9 kB

license: apache-2.0
tags:
  - gguf
  - qwen
  - qwen3
  - qwen3-14b
  - qwen3-14b-gguf
  - llama.cpp
  - quantized
  - text-generation
  - reasoning
  - agent
  - multilingual
  - imatrix
  - q3_hifi
  - q4_hifi
  - q5_hifi
base_model: Qwen/Qwen3-14B
author: geoffmunn
pipeline_tag: text-generation
language:
  - en
  - zh
  - es
  - fr
  - de
  - ru
  - ar
  - ja
  - ko
  - hi

Qwen3-14B-f16-GGUF

This is a GGUF-quantized version of the Qwen/Qwen3-14B language model — a 14-billion-parameter LLM with deep reasoning, research-grade accuracy, and autonomous workflows. Converted for use with \llama.cpp, LM Studio, OpenWebUI, GPT4All, and more.

Why Use a 14B Model?

The Qwen3-14B model delivers serious intelligence in a locally runnable package, offering near-flagship performance while remaining feasible to run on a single high-end consumer GPU or a well-equipped CPU setup. It’s the optimal choice when you need strong reasoning, robust code generation, and deep language understanding—without relying on the cloud or massive infrastructure.

Highlights:

State-of-the-art performance among open 14B-class models, excelling in reasoning, math, coding, and multilingual tasks
Efficient inference with quantization: runs on a 24 GB GPU (e.g., RTX 4090) or even CPU with quantized GGUF/AWQ variants (~12–14 GB RAM usage)
Strong contextual handling: supports long inputs and complex multi-step workflows, ideal for agentic or RAG-based systems
Fully open and commercially usable, giving you full control over deployment and customization

It’s ideal for:

Self-hosted AI assistants that understand nuance, remember context, and generate high-quality responses
On-prem development environments needing local code completion, documentation, or debugging
Private RAG or enterprise applications requiring accuracy, reliability, and data sovereignty
Researchers and developers seeking a powerful, open-weight alternative to closed 10B–20B models

Choose Qwen3-14B when you’ve outgrown 7B–8B models but still want to run efficiently offline—balancing capability, control, and cost without sacrificing quality.

Qwen3 14B Quantization Guide: Cross-Bit Summary & Recommendations

Executive Summary

At 14B scale, quantization quality is exceptional across all bit widths—models are inherently resilient to compression, with even Q3_K achieving near-lossless fidelity (+2.5% loss with imatrix). All variants deliver production-ready quality, making 14B the "sweet spot" where aggressive quantization meets robust model architecture. The choice depends entirely on your constraints:

Quantization	Best Variant (+ imatrix)	Quality vs F16	File Size	Speed	Memory
Q5_K	Q5_K_M + imatrix	+0.59% (best)	9.55 GiB	63.81 TPS	10,021 MiB
Q4_K	Q4_K_M + imatrix	+1.2%	8.38 GiB	72.89 TPS	8,581 MiB
Q3_K	Q3_K_HIFI + imatrix	+2.5%	7.93 GiB	63.93 TPS	8,120 MiB

💡 Critical insight: 14B models quantize superbly—even Q3_K_HIFI + imatrix achieves only +2.5% precision loss, making 3-bit quantization viable for production use. imatrix provides modest but valuable gains, though Q4_K_HIFI is uniquely harmed by imatrix (+0.6% degradation).

Bit-Width Recommendations by Use Case

✅ Quality-Critical Applications

→ Q5_K_M + imatrix

Best perplexity at 9.0680 PPL (+0.59% vs F16) — near-lossless fidelity
64.4% memory reduction (10,021 MiB vs 28,170 MiB)
148% faster than F16 (63.81 TPS vs 25.73 TPS)
Standard llama.cpp compatibility — no custom builds needed
⚠️ Avoid Q5_K_HIFI — provides no measurable advantage over Q5_K_M (+0.02% worse with imatrix) while requiring custom build and 2.3% more memory

⚖️ Best Overall Balance (Recommended Default)

→ Q4_K_M + imatrix

Excellent +1.2% precision loss vs F16 (PPL 9.1247)
Strong 72.89 TPS speed (+183% vs F16)
Compact 8.38 GiB file size (69.5% smaller than F16)
Standard llama.cpp compatibility — universal toolchain support
Ideal for most development and production scenarios

🚀 Maximum Speed / Minimum Size

→ Q3_K_S + imatrix

Fastest variant at 91.32 TPS (+255% vs F16)
Smallest footprint at 6.19 GiB (77.5% memory reduction)
Acceptable +6.5% precision loss with imatrix (unusable at +7.7% without)
⚠️ Never use Q3_K_S without imatrix — quality degrades severely

📱 Extreme Memory Constraints (< 8 GiB)

→ Q3_K_S + imatrix

Absolute smallest runtime at 6,339 MiB
Only viable option under 8 GiB budget
+6.5% quality loss acceptable for non-critical tasks

💎 Near-Lossless 3-Bit Option

→ Q3_K_HIFI + imatrix

Surprisingly good quality at +2.5% loss — production-ready for Q3
71.2% memory reduction (8,120 MiB)
Unique value: When you need Q3 size/speed but can't accept Q3_K_S quality
⚠️ 23% slower than Q3_K_M — significant speed trade-off

Critical Warnings for 14B Scale

⚠️ Q4_K_HIFI + imatrix is counterproductive — imatrix degrades quality by +0.6% (9.0847 → 9.1393 PPL). This is unique to 14B scale.

Without imatrix: Q4_K_HIFI is best Q4 quality (+0.8% vs F16)
With imatrix: Q4_K_M is best Q4 quality (+1.2% vs F16)
Never use imatrix with Q4_K_HIFI at 14B

⚠️ Q5_K_HIFI provides zero advantage at 14B:

Quality is worse than Q5_K_M with imatrix (+0.61% vs +0.59%)
Costs +467 MiB memory (+4.8% overhead) and requires custom build
Skip it entirely — Q5_K_M is strictly superior for production use

⚠️ All Q3_K variants are production-ready — even Q3_K_S with imatrix (+6.5% loss) remains usable, a dramatic improvement over smaller scales where Q3 often fails.

Q3_K_HIFI without imatrix: +2.6% loss (excellent)
Q3_K_M with imatrix: +2.9% loss (excellent)
This is the smallest scale where Q3 quantization is reliably viable

⚠️ imatrix impact is minimal at 14B — Unlike smaller models where imatrix recovers 60–78% of lost precision, at 14B the gains are modest (0.1–2.6%):

Q5_K variants: +1.1–1.3% improvement
Q4_K_M: +0.1% improvement (negligible)
Q4_K_S: +0.5% improvement
Q3_K_HIFI: -0.1% (no change — already near-perfect)

Memory Budget Guide

Available VRAM	Recommended Variant	Expected Quality	Why
< 6.5 GiB	Q3_K_S + imatrix	PPL 9.60, +6.5% loss	Only option that fits; quality acceptable for non-critical tasks
6.5 – 8.2 GiB	Q3_K_M + imatrix	PPL 9.28, +2.9% loss ✅	Best Q3 balance; production-ready quality
8.2 – 10.1 GiB	Q4_K_M + imatrix	PPL 9.12, +1.2% loss ✅	Best overall balance; standard compatibility
10.1 – 12.0 GiB	Q5_K_M + imatrix	PPL 9.07, +0.59% loss ✅	Near-lossless quality; best precision available
> 12.0 GiB	Q5_K_M + imatrix or F16	PPL 9.07 or 9.01	F16 only if absolute precision required

Cross-Bit Performance Comparison

Priority	Q3_K Best	Q4_K Best	Q5_K Best	Winner
Quality (with imat)	Q3_K_HIFI (+2.5%)	Q4_K_M (+1.2%)	Q5_K_M (+0.59%) ✅	Q5_K_M
Quality (no imat)	Q3_K_HIFI (+2.6%)	Q4_K_HIFI (+0.8%) ✅	Q5_K_S (+1.84%)	Q4_K_HIFI
Speed	Q3_K_S (91.32 TPS) ✅	Q4_K_S (76.34 TPS)	Q5_K_S (65.40 TPS)	Q3_K_S
Smallest Size	Q3_K_S (6.19 GiB) ✅	Q4_K_S (7.98 GiB)	Q5_K_S (9.33 GiB)	Q3_K_S
Best Balance	Q3_K_M + imat	Q4_K_M + imat ✅	Q5_K_M + imat	Q4_K_M

✅ = Recommended for general use
⚠️ = Context-dependent (see warnings above)

Scale-Specific Insights: Why 14B Quantizes So Well

Model redundancy threshold: 14B represents the inflection point where parameter count provides sufficient redundancy that quantization errors average out rather than accumulating. Below 8B, quality degrades more rapidly; above 14B, gains plateau.
Q3_K viability threshold: 14B is the smallest scale where Q3_K_HIFI achieves truly production-ready quality (+2.5% with imatrix). At 8B, Q3_K_HIFI is +3.5%; at 4B, +5.9%; at 1.7B, +3.4% but with much higher baseline PPL.
imatrix diminishing returns: At 14B, imatrix effectiveness plateaus — Q3_K_HIFI improves by only 0.1%, Q4_K_M by 0.1%, Q5_K variants by 1.1–1.3%. This contrasts sharply with 0.6B (40–48% recovery) and 1.7B (60–78% recovery).
Q4_K_HIFI paradox: Unlike at 8B (where imatrix helps Q4_K_HIFI by -1.1%) or 32B (where it helps by -0.7%), at 14B imatrix harms Q4_K_HIFI (+0.6%). This demonstrates non-linear scale effects in quantization behavior.
Q5_K_HIFI irrelevance: At 14B, residual quantization provides no measurable benefit — the model's inherent robustness makes the extra precision unnecessary. This changes at 32B where Q5_K_HIFI + imatrix achieves F16-equivalence.

Decision Flowchart

Need best quality?
├─ Yes → Q5_K_M + imatrix (+0.59% loss)
└─ No → Need smallest size/speed?
     ├─ Yes → Memory < 8 GiB? 
     │        ├─ Yes → Q3_K_S + imatrix (6,339 MiB, +6.5% loss)
     │        └─ No  → Q4_K_S + imatrix (8,172 MiB, +1.4% loss, 76.34 TPS)
     └─ No  → Q4_K_M + imatrix (best balance, +1.2% loss, standard build)

Practical Deployment Recommendations

For Most Users

→ Q4_K_M + imatrix
Delivers excellent quality (+1.2% vs F16), strong speed (72.89 TPS), compact size (8.38 GiB), and universal llama.cpp compatibility. The safe, practical choice for 90% of deployments.

For Quality-Critical Work

→ Q5_K_M + imatrix
Achieves near-lossless quantization (+0.59% vs F16) with 64% memory reduction and 2.5× speedup. Standard compatibility makes it preferable to Q5_K_HIFI, which offers no advantage.

For Edge/Mobile Deployment

→ Q3_K_M + imatrix
Best Q3 quality (+2.9% vs F16) with smallest viable footprint (6,973 MiB). Production-ready even without imatrix (+5.7% loss) — valuable for environments where imatrix generation isn't feasible.

For High-Throughput Serving

→ Q3_K_S + imatrix
Fastest variant (91.32 TPS, +255% vs F16) with acceptable quality (+6.5% loss). Ideal when every TPS matters and marginal quality differences are acceptable.

For Research on Quantization Limits

→ Q3_K_HIFI + imatrix
Demonstrates that 3-bit quantization can achieve near-lossless quality (+2.5% loss) on sufficiently large models. Valuable for characterizing the lower bounds of viable quantization.

Bottom Line Recommendations

Scenario	Recommended Variant	Rationale
Default / General Purpose	Q4_K_M + imatrix	Best balance of quality, speed, size, and compatibility
Maximum Quality	Q5_K_M + imatrix	Near-lossless (+0.59% vs F16) with standard toolchain
Minimum Size	Q3_K_S + imatrix	Smallest footprint (6.19 GiB) with acceptable quality
Maximum Speed	Q3_K_S + imatrix	Fastest (91.32 TPS) at 3.6× F16 speed
No imatrix available	Q4_K_HIFI (no imat)	Best quality without imatrix (+0.8% vs F16)
Extreme constraints	Q3_K_S + imatrix	Only if memory < 8 GiB; +6.5% loss acceptable

⚠️ Golden rules for 14B:

Never use imatrix with Q4_K_HIFI — it degrades quality
Skip Q5_K_HIFI entirely — no advantage over Q5_K_M
All three bit widths are viable — choose based on constraints, not quality cliffs
Q3_K is production-ready — the first scale where 3-bit quantization reliably works

✅ 14B is the quantization resilience milestone: Large enough for robustness across all bit widths, small enough for dramatic efficiency gains. This scale demonstrates that intelligent quantization can deliver near-F16 quality at 1/3 the memory with 2.5–3.5× speed — a compelling value proposition for nearly all deployments.

Non-technical model anaysis and rankings

NOTE: This analysis does not include the HIFI models.

There are two good candidates: Qwen3-14B-f16:Q3_K_S and Qwen3-14B-f16:Q5_K_M. These cover the full range of temperatures and are good at all question types.

Another good option would be Qwen3-14B-f16:Q3_K_M, with good finishes across the temperature range.

Qwen3-14B-f16:Q2_K got very good results and would have been a 1st or 2nd place candidate but was the only model to fail the 'hello' question which it should have passed.

You can read the results here: Qwen3-14b-analysis.md

If you find this useful, please give the project a ❤️ like.

Non-HIFI recommentation table based on output

Level	Speed	Size	Recommendation
Q2_K	⚡ Fastest	5.75 GB	An excellent option but it failed the 'hello' test. Use with caution.
🥇 Q3_K_S	⚡ Fast	6.66 GB	🥇 Best overall model. Two first places and two 3rd places. Excellent results across the full temperature range.
🥉 Q3_K_M	⚡ Fast	7.32 GB	🥉 A good option - it came 1st and 3rd, covering both ends of the temperature range.
Q4_K_S	🚀 Fast	8.57 GB	Not recommended, two 2nd places in low temperature questions with no other appearances.
Q4_K_M	🚀 Fast	9.00 GB	Not recommended. A single 3rd place with no other appearances.
🥈 Q5_K_S	🐢 Medium	10.3 GB	🥈 A very good second place option. A top 3 finisher across the full temperature range.
Q5_K_M	🐢 Medium	10.5 GB	Not recommended. A single 3rd place with no other appearances.
Q6_K	🐌 Slow	12.1 GB	Not recommended. No top 3 finishes at all.
Q8_0	🐌 Slow	15.7 GB	Not recommended. A single 2nd place with no other appearances.

Build notes

All of these models were built using these commands:

mkdir build
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DLLAMA_CURL=OFF
cmake --build build --config Release -j

NOTE: Vulkan support is specifically turned off here. Vulkan performance was much worse, so if you want Vulkan support you can rebuild these models yourself.

The HIFI quantization also used a very large 4697 chunk imatrix file for extra precision. You can re-use it here: Qwen3-14B-f16-imatrix-4697-generic.gguf

The imatrix was created as a generic mix of Wikipedia, mathmatics, and coding examples.

Source code

You can use the HIFI GitHub repository to build it from source if you're interested: https://github.com/geoffmunn/llama.cpp.

Build notes: HIFI_BUILD_GUIDE.md

Improvements and feedback are welcome.

Usage

Load this model using:

OpenWebUI – self-hosted AI interface with RAG & tools
LM Studio – desktop app with GPU support and chat templates
GPT4All – private, local AI chatbot (offline-first)
Or directly via llama.cpp

Each quantized model includes its own README.md and shares a common MODELFILE for optimal configuration.

Importing directly into Ollama should work, but you might encounter this error: Error: invalid character '<' looking for beginning of value. In this case try these steps:

wget https://huggingface.co/geoffmunn/Qwen3-14B/resolve/main/Qwen3-14B-f16%3AQ3_K_S.gguf (replace the quantised version with the one you want)
nano Modelfile and enter these details (again, replacing Q3_K_S with the version you want):

FROM ./Qwen3-14B-f16:Q3_K_S.gguf

# Chat template using ChatML (used by Qwen)
SYSTEM You are a helpful assistant

TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

# Default sampling
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096

The num_ctx value has been dropped to increase speed significantly.

Then run this command: ollama create Qwen3-14B-f16:Q3_K_S -f Modelfile

You will now see "Qwen3-14B-f16:Q3_K_S" in your Ollama model list.

These import steps are also useful if you want to customise the default parameters or system prompt.

Author

👤 Geoff Munn (@geoffmunn)
🔗 Hugging Face Profile

Disclaimer

This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.