Text Generation
GGUF
qwen
qwen3
qwen3-14b
qwen3-14b-gguf
llama.cpp
quantized
reasoning
agent
multilingual
imatrix
q3_hifi
q4_hifi
q5_hifi
conversational
Instructions to use geoffmunn/Qwen3-14B-f16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use geoffmunn/Qwen3-14B-f16 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="geoffmunn/Qwen3-14B-f16", filename="Qwen3-14B-f16-imatrix-4697-coder.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use geoffmunn/Qwen3-14B-f16 with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M # Run inference directly in the terminal: llama cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M # Run inference directly in the terminal: llama cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf geoffmunn/Qwen3-14B-f16:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf geoffmunn/Qwen3-14B-f16:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Use Docker
docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use geoffmunn/Qwen3-14B-f16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "geoffmunn/Qwen3-14B-f16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "geoffmunn/Qwen3-14B-f16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
- Ollama
How to use geoffmunn/Qwen3-14B-f16 with Ollama:
ollama run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
- Unsloth Studio
How to use geoffmunn/Qwen3-14B-f16 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for geoffmunn/Qwen3-14B-f16 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for geoffmunn/Qwen3-14B-f16 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for geoffmunn/Qwen3-14B-f16 to start chatting
- Pi
How to use geoffmunn/Qwen3-14B-f16 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "geoffmunn/Qwen3-14B-f16:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use geoffmunn/Qwen3-14B-f16 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default geoffmunn/Qwen3-14B-f16:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use geoffmunn/Qwen3-14B-f16 with Docker Model Runner:
docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
- Lemonade
How to use geoffmunn/Qwen3-14B-f16 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull geoffmunn/Qwen3-14B-f16:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3-14B-f16-Q4_K_M
List all available models
lemonade list
File size: 17,898 Bytes
db0880e 42c56d1 db0880e efdca28 bab98fc 944ecfe db0880e bab98fc db0880e 944ecfe bab98fc 944ecfe bab98fc 944ecfe bab98fc 944ecfe bab98fc 944ecfe bab98fc 944ecfe bab98fc 944ecfe bab98fc 944ecfe db0880e cd71e7e 944ecfe bab98fc 944ecfe bab98fc 944ecfe bab98fc 944ecfe a143060 944ecfe a143060 944ecfe a143060 944ecfe a143060 944ecfe db0880e a143060 4602edf db0880e be27631 db0880e a143060 db0880e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 | ---
license: apache-2.0
tags:
- gguf
- qwen
- qwen3
- qwen3-14b
- qwen3-14b-gguf
- llama.cpp
- quantized
- text-generation
- reasoning
- agent
- multilingual
- imatrix
- q3_hifi
- q4_hifi
- q5_hifi
base_model: Qwen/Qwen3-14B
author: geoffmunn
pipeline_tag: text-generation
language:
- en
- zh
- es
- fr
- de
- ru
- ar
- ja
- ko
- hi
---
# Qwen3-14B-f16-GGUF
This is a **GGUF-quantized version** of the **[Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)** language model β a **14-billion-parameter** LLM with deep reasoning, research-grade accuracy, and autonomous workflows. Converted for use with \llama.cpp\, [LM Studio](https://lmstudio.ai), [OpenWebUI](https://openwebui.com), [GPT4All](https://gpt4all.io), and more.
## Why Use a 14B Model?
The **Qwen3-14B** model delivers **serious intelligence in a locally runnable package**, offering near-flagship performance while remaining feasible to run on a single high-end consumer GPU or a well-equipped CPU setup. Itβs the optimal choice when you need strong reasoning, robust code generation, and deep language understandingβwithout relying on the cloud or massive infrastructure.
### Highlights:
- **State-of-the-art performance among open 14B-class models**, excelling in reasoning, math, coding, and multilingual tasks
- **Efficient inference with quantization**: runs on a 24 GB GPU (e.g., RTX 4090) or even CPU with quantized GGUF/AWQ variants (~12β14 GB RAM usage)
- **Strong contextual handling**: supports long inputs and complex multi-step workflows, ideal for agentic or RAG-based systems
- **Fully open and commercially usable**, giving you full control over deployment and customization
### Itβs ideal for:
- **Self-hosted AI assistants** that understand nuance, remember context, and generate high-quality responses
- **On-prem development environments** needing local code completion, documentation, or debugging
- **Private RAG or enterprise applications** requiring accuracy, reliability, and data sovereignty
- **Researchers and developers** seeking a powerful, open-weight alternative to closed 10Bβ20B models
Choose **Qwen3-14B** when youβve outgrown 7Bβ8B models but still want to run efficiently offlineβbalancing capability, control, and cost without sacrificing quality.
# Qwen3 14B Quantization Guide: Cross-Bit Summary & Recommendations
## Executive Summary
At 14B scale, **quantization quality is exceptional across all bit widths**βmodels are inherently resilient to compression, with even Q3_K achieving near-lossless fidelity (+2.5% loss with imatrix). All variants deliver production-ready quality, making 14B the "sweet spot" where aggressive quantization meets robust model architecture. The choice depends entirely on your constraints:
| Quantization | Best Variant (+ imatrix) | Quality vs F16 | File Size | Speed | Memory |
|--------------|--------------------------|----------------|-----------|-------|--------|
| **Q5_K** | Q5_K_M + imatrix | **+0.59%** (best) | 9.55 GiB | 63.81 TPS | 10,021 MiB |
| **Q4_K** | Q4_K_M + imatrix | +1.2% | 8.38 GiB | 72.89 TPS | 8,581 MiB |
| **Q3_K** | Q3_K_HIFI + imatrix | +2.5% | 7.93 GiB | 63.93 TPS | 8,120 MiB |
π‘ **Critical insight**: 14B models quantize superblyβeven **Q3_K_HIFI + imatrix achieves only +2.5% precision loss**, making 3-bit quantization viable for production use. imatrix provides modest but valuable gains, though **Q4_K_HIFI is uniquely harmed by imatrix** (+0.6% degradation).
---
## Bit-Width Recommendations by Use Case
### β
Quality-Critical Applications
**β Q5_K_M + imatrix**
- Best perplexity at **9.0680 PPL (+0.59% vs F16)** β near-lossless fidelity
- 64.4% memory reduction (10,021 MiB vs 28,170 MiB)
- 148% faster than F16 (63.81 TPS vs 25.73 TPS)
- **Standard llama.cpp compatibility** β no custom builds needed
- β οΈ **Avoid Q5_K_HIFI** β provides *no measurable advantage* over Q5_K_M (+0.02% worse with imatrix) while requiring custom build and 2.3% more memory
### βοΈ Best Overall Balance (Recommended Default)
**β Q4_K_M + imatrix**
- Excellent +1.2% precision loss vs F16 (PPL 9.1247)
- Strong 72.89 TPS speed (+183% vs F16)
- Compact 8.38 GiB file size (69.5% smaller than F16)
- **Standard llama.cpp compatibility** β universal toolchain support
- Ideal for most development and production scenarios
### π Maximum Speed / Minimum Size
**β Q3_K_S + imatrix**
- Fastest variant at **91.32 TPS** (+255% vs F16)
- Smallest footprint at **6.19 GiB** (77.5% memory reduction)
- Acceptable +6.5% precision loss with imatrix (unusable at +7.7% without)
- β οΈ **Never use Q3_K_S without imatrix** β quality degrades severely
### π± Extreme Memory Constraints (< 8 GiB)
**β Q3_K_S + imatrix**
- Absolute smallest runtime at **6,339 MiB**
- Only viable option under 8 GiB budget
- +6.5% quality loss acceptable for non-critical tasks
### π Near-Lossless 3-Bit Option
**β Q3_K_HIFI + imatrix**
- **Surprisingly good quality at +2.5% loss** β production-ready for Q3
- 71.2% memory reduction (8,120 MiB)
- Unique value: When you need Q3 size/speed but can't accept Q3_K_S quality
- β οΈ **23% slower than Q3_K_M** β significant speed trade-off
---
## Critical Warnings for 14B Scale
β οΈ **Q4_K_HIFI + imatrix is counterproductive** β imatrix *degrades* quality by +0.6% (9.0847 β 9.1393 PPL). This is unique to 14B scale.
- **Without imatrix**: Q4_K_HIFI is best Q4 quality (+0.8% vs F16)
- **With imatrix**: Q4_K_M is best Q4 quality (+1.2% vs F16)
- **Never use imatrix with Q4_K_HIFI at 14B**
β οΈ **Q5_K_HIFI provides zero advantage at 14B**:
- Quality is *worse* than Q5_K_M with imatrix (+0.61% vs +0.59%)
- Costs +467 MiB memory (+4.8% overhead) and requires custom build
- **Skip it entirely** β Q5_K_M is strictly superior for production use
β οΈ **All Q3_K variants are production-ready** β even Q3_K_S with imatrix (+6.5% loss) remains usable, a dramatic improvement over smaller scales where Q3 often fails.
- Q3_K_HIFI without imatrix: +2.6% loss (excellent)
- Q3_K_M with imatrix: +2.9% loss (excellent)
- This is the smallest scale where Q3 quantization is reliably viable
β οΈ **imatrix impact is minimal at 14B** β Unlike smaller models where imatrix recovers 60β78% of lost precision, at 14B the gains are modest (0.1β2.6%):
- Q5_K variants: +1.1β1.3% improvement
- Q4_K_M: +0.1% improvement (negligible)
- Q4_K_S: +0.5% improvement
- Q3_K_HIFI: -0.1% (no change β already near-perfect)
---
## Memory Budget Guide
| Available VRAM | Recommended Variant | Expected Quality | Why |
|----------------|---------------------|------------------|-----|
| **< 6.5 GiB** | Q3_K_S + imatrix | PPL 9.60, +6.5% loss | Only option that fits; quality acceptable for non-critical tasks |
| **6.5 β 8.2 GiB** | Q3_K_M + imatrix | PPL 9.28, +2.9% loss β
| Best Q3 balance; production-ready quality |
| **8.2 β 10.1 GiB** | Q4_K_M + imatrix | PPL 9.12, +1.2% loss β
| Best overall balance; standard compatibility |
| **10.1 β 12.0 GiB** | Q5_K_M + imatrix | PPL 9.07, +0.59% loss β
| Near-lossless quality; best precision available |
| **> 12.0 GiB** | Q5_K_M + imatrix or F16 | PPL 9.07 or 9.01 | F16 only if absolute precision required |
---
## Cross-Bit Performance Comparison
| Priority | Q3_K Best | Q4_K Best | Q5_K Best | Winner |
|----------|-----------|-----------|-----------|--------|
| **Quality (with imat)** | Q3_K_HIFI (+2.5%) | Q4_K_M (+1.2%) | **Q5_K_M (+0.59%)** β
| **Q5_K_M** |
| **Quality (no imat)** | Q3_K_HIFI (+2.6%) | **Q4_K_HIFI (+0.8%)** β
| Q5_K_S (+1.84%) | **Q4_K_HIFI** |
| **Speed** | **Q3_K_S (91.32 TPS)** β
| Q4_K_S (76.34 TPS) | Q5_K_S (65.40 TPS) | **Q3_K_S** |
| **Smallest Size** | **Q3_K_S (6.19 GiB)** β
| Q4_K_S (7.98 GiB) | Q5_K_S (9.33 GiB) | **Q3_K_S** |
| **Best Balance** | Q3_K_M + imat | **Q4_K_M + imat** β
| Q5_K_M + imat | **Q4_K_M** |
β
= Recommended for general use
β οΈ = Context-dependent (see warnings above)
---
## Scale-Specific Insights: Why 14B Quantizes So Well
1. **Model redundancy threshold**: 14B represents the inflection point where parameter count provides sufficient redundancy that quantization errors average out rather than accumulating. Below 8B, quality degrades more rapidly; above 14B, gains plateau.
2. **Q3_K viability threshold**: 14B is the smallest scale where **Q3_K_HIFI achieves truly production-ready quality** (+2.5% with imatrix). At 8B, Q3_K_HIFI is +3.5%; at 4B, +5.9%; at 1.7B, +3.4% but with much higher baseline PPL.
3. **imatrix diminishing returns**: At 14B, imatrix effectiveness plateaus β Q3_K_HIFI improves by only 0.1%, Q4_K_M by 0.1%, Q5_K variants by 1.1β1.3%. This contrasts sharply with 0.6B (40β48% recovery) and 1.7B (60β78% recovery).
4. **Q4_K_HIFI paradox**: Unlike at 8B (where imatrix helps Q4_K_HIFI by -1.1%) or 32B (where it helps by -0.7%), at 14B imatrix *harms* Q4_K_HIFI (+0.6%). This demonstrates non-linear scale effects in quantization behavior.
5. **Q5_K_HIFI irrelevance**: At 14B, residual quantization provides no measurable benefit β the model's inherent robustness makes the extra precision unnecessary. This changes at 32B where Q5_K_HIFI + imatrix achieves F16-equivalence.
---
## Decision Flowchart
```mermaid
Need best quality?
ββ Yes β Q5_K_M + imatrix (+0.59% loss)
ββ No β Need smallest size/speed?
ββ Yes β Memory < 8 GiB?
β ββ Yes β Q3_K_S + imatrix (6,339 MiB, +6.5% loss)
β ββ No β Q4_K_S + imatrix (8,172 MiB, +1.4% loss, 76.34 TPS)
ββ No β Q4_K_M + imatrix (best balance, +1.2% loss, standard build)
```
---
## Practical Deployment Recommendations
### For Most Users
**β Q4_K_M + imatrix**
Delivers excellent quality (+1.2% vs F16), strong speed (72.89 TPS), compact size (8.38 GiB), and universal llama.cpp compatibility. The safe, practical choice for 90% of deployments.
### For Quality-Critical Work
**β Q5_K_M + imatrix**
Achieves near-lossless quantization (+0.59% vs F16) with 64% memory reduction and 2.5Γ speedup. Standard compatibility makes it preferable to Q5_K_HIFI, which offers no advantage.
### For Edge/Mobile Deployment
**β Q3_K_M + imatrix**
Best Q3 quality (+2.9% vs F16) with smallest viable footprint (6,973 MiB). Production-ready even without imatrix (+5.7% loss) β valuable for environments where imatrix generation isn't feasible.
### For High-Throughput Serving
**β Q3_K_S + imatrix**
Fastest variant (91.32 TPS, +255% vs F16) with acceptable quality (+6.5% loss). Ideal when every TPS matters and marginal quality differences are acceptable.
### For Research on Quantization Limits
**β Q3_K_HIFI + imatrix**
Demonstrates that 3-bit quantization can achieve near-lossless quality (+2.5% loss) on sufficiently large models. Valuable for characterizing the lower bounds of viable quantization.
---
## Bottom Line Recommendations
| Scenario | Recommended Variant | Rationale |
|----------|---------------------|-----------|
| **Default / General Purpose** | Q4_K_M + imatrix | Best balance of quality, speed, size, and compatibility |
| **Maximum Quality** | Q5_K_M + imatrix | Near-lossless (+0.59% vs F16) with standard toolchain |
| **Minimum Size** | Q3_K_S + imatrix | Smallest footprint (6.19 GiB) with acceptable quality |
| **Maximum Speed** | Q3_K_S + imatrix | Fastest (91.32 TPS) at 3.6Γ F16 speed |
| **No imatrix available** | Q4_K_HIFI (no imat) | Best quality without imatrix (+0.8% vs F16) |
| **Extreme constraints** | Q3_K_S + imatrix | Only if memory < 8 GiB; +6.5% loss acceptable |
β οΈ **Golden rules for 14B**:
1. **Never use imatrix with Q4_K_HIFI** β it degrades quality
2. **Skip Q5_K_HIFI entirely** β no advantage over Q5_K_M
3. **All three bit widths are viable** β choose based on constraints, not quality cliffs
4. **Q3_K is production-ready** β the first scale where 3-bit quantization reliably works
β
**14B is the quantization resilience milestone**: Large enough for robustness across all bit widths, small enough for dramatic efficiency gains. This scale demonstrates that intelligent quantization can deliver near-F16 quality at 1/3 the memory with 2.5β3.5Γ speed β a compelling value proposition for nearly all deployments.
## Non-technical model anaysis and rankings
**NOTE:** This analysis does not include the HIFI models.
There are two good candidates: **Qwen3-14B-f16:Q3_K_S** and **Qwen3-14B-f16:Q5_K_M**. These cover the full range of temperatures and are good at all question types.
Another good option would be **Qwen3-14B-f16:Q3_K_M**, with good finishes across the temperature range.
**Qwen3-14B-f16:Q2_K** got very good results and would have been a 1st or 2nd place candidate but was the only model to fail the 'hello' question which it should have passed.
You can read the results here: [Qwen3-14b-analysis.md](Qwen3-14b-analysis.md)
If you find this useful, please give the project a β€οΈ like.
## Non-HIFI recommentation table based on output
| Level | Speed | Size | Recommendation |
|-----------|-----------|-------------|----------------------------------------------------------------------------------------------------------------------|
| Q2_K | β‘ Fastest | 5.75 GB | An excellent option but it failed the 'hello' test. Use with caution. |
| π₯ Q3_K_S | β‘ Fast | 6.66 GB | π₯ **Best overall model.** Two first places and two 3rd places. Excellent results across the full temperature range. |
| π₯ Q3_K_M | β‘ Fast | 7.32 GB | π₯ A good option - it came 1st and 3rd, covering both ends of the temperature range. |
| Q4_K_S | π Fast | 8.57 GB | Not recommended, two 2nd places in low temperature questions with no other appearances. |
| Q4_K_M | π Fast | 9.00 GB | Not recommended. A single 3rd place with no other appearances. |
| π₯ Q5_K_S | π’ Medium | 10.3 GB | π₯ A very good second place option. A top 3 finisher across the full temperature range. |
| Q5_K_M | π’ Medium | 10.5 GB | Not recommended. A single 3rd place with no other appearances. |
| Q6_K | π Slow | 12.1 GB | Not recommended. No top 3 finishes at all. |
| Q8_0 | π Slow | 15.7 GB | Not recommended. A single 2nd place with no other appearances.
## Build notes
All of these models were built using these commands:
```bash
mkdir build
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DLLAMA_CURL=OFF
cmake --build build --config Release -j
```
**NOTE:** Vulkan support is specifically turned off here. Vulkan performance was much worse, so if you want Vulkan support you can rebuild these models yourself.
The HIFI quantization also used a very large 4697 chunk imatrix file for extra precision. You can re-use it here: [Qwen3-14B-f16-imatrix-4697-generic.gguf](https://huggingface.co/geoffmunn/Qwen3-14B-f16/blob/main/Qwen3-14B-f16-imatrix-4697-generic.gguf)
The imatrix was created as a generic mix of Wikipedia, mathmatics, and coding examples.
### Source code
You can use the HIFI GitHub repository to build it from source if you're interested: [https://github.com/geoffmunn/llama.cpp](https://github.com/geoffmunn/llama.cpp).
Build notes: [HIFI_BUILD_GUIDE.md](https://github.com/geoffmunn/llama.cpp/blob/master/HIFI_BUILD_GUIDE.md)
Improvements and feedback are welcome.
## Usage
Load this model using:
- [OpenWebUI](https://openwebui.com) β self-hosted AI interface with RAG & tools
- [LM Studio](https://lmstudio.ai) β desktop app with GPU support and chat templates
- [GPT4All](https://gpt4all.io) β private, local AI chatbot (offline-first)
- Or directly via `llama.cpp`
Each quantized model includes its own `README.md` and shares a common `MODELFILE` for optimal configuration.
Importing directly into Ollama should work, but you might encounter this error: `Error: invalid character '<' looking for beginning of value`.
In this case try these steps:
1. `wget https://huggingface.co/geoffmunn/Qwen3-14B/resolve/main/Qwen3-14B-f16%3AQ3_K_S.gguf` (replace the quantised version with the one you want)
2. `nano Modelfile` and enter these details (again, replacing Q3_K_S with the version you want):
```text
FROM ./Qwen3-14B-f16:Q3_K_S.gguf
# Chat template using ChatML (used by Qwen)
SYSTEM You are a helpful assistant
TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
# Default sampling
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096
```
The `num_ctx` value has been dropped to increase speed significantly.
3. Then run this command: `ollama create Qwen3-14B-f16:Q3_K_S -f Modelfile`
You will now see "Qwen3-14B-f16:Q3_K_S" in your Ollama model list.
These import steps are also useful if you want to customise the default parameters or system prompt.
## Author
π€ Geoff Munn (@geoffmunn)
π [Hugging Face Profile](https://huggingface.co/geoffmunn)
## Disclaimer
This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.
|