Instructions to use geoffmunn/Qwen3-14B-f16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use geoffmunn/Qwen3-14B-f16 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="geoffmunn/Qwen3-14B-f16",
	filename="Qwen3-14B-f16-imatrix-4697-coder.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use geoffmunn/Qwen3-14B-f16 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Use Docker

docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M

LM Studio
Jan

vLLM

How to use geoffmunn/Qwen3-14B-f16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "geoffmunn/Qwen3-14B-f16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "geoffmunn/Qwen3-14B-f16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M

Ollama
How to use geoffmunn/Qwen3-14B-f16 with Ollama:
```
ollama run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
```

Unsloth Studio

How to use geoffmunn/Qwen3-14B-f16 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for geoffmunn/Qwen3-14B-f16 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for geoffmunn/Qwen3-14B-f16 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for geoffmunn/Qwen3-14B-f16 to start chatting

How to use geoffmunn/Qwen3-14B-f16 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "geoffmunn/Qwen3-14B-f16:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use geoffmunn/Qwen3-14B-f16 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default geoffmunn/Qwen3-14B-f16:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use geoffmunn/Qwen3-14B-f16 with Docker Model Runner:
```
docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
```

Lemonade

How to use geoffmunn/Qwen3-14B-f16 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull geoffmunn/Qwen3-14B-f16:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-14B-f16-Q4_K_M

List all available models

lemonade list

Qwen3-14B-f16

File size: 17,898 Bytes

---
license: apache-2.0
tags:
  - gguf
  - qwen
  - qwen3
  - qwen3-14b
  - qwen3-14b-gguf
  - llama.cpp
  - quantized
  - text-generation
  - reasoning   
  - agent   
  - multilingual
  - imatrix
  - q3_hifi
  - q4_hifi
  - q5_hifi
base_model: Qwen/Qwen3-14B
author: geoffmunn
pipeline_tag: text-generation
language:
  - en
  - zh
  - es
  - fr
  - de
  - ru
  - ar
  - ja
  - ko
  - hi
---

# Qwen3-14B-f16-GGUF

This is a **GGUF-quantized version** of the **[Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)** language model — a **14-billion-parameter** LLM with deep reasoning, research-grade accuracy, and autonomous workflows. Converted for use with \llama.cpp\, [LM Studio](https://lmstudio.ai), [OpenWebUI](https://openwebui.com), [GPT4All](https://gpt4all.io), and more.

## Why Use a 14B Model?

The **Qwen3-14B** model delivers **serious intelligence in a locally runnable package**, offering near-flagship performance while remaining feasible to run on a single high-end consumer GPU or a well-equipped CPU setup. It’s the optimal choice when you need strong reasoning, robust code generation, and deep language understanding—without relying on the cloud or massive infrastructure.

### Highlights:
- **State-of-the-art performance among open 14B-class models**, excelling in reasoning, math, coding, and multilingual tasks  
- **Efficient inference with quantization**: runs on a 24 GB GPU (e.g., RTX 4090) or even CPU with quantized GGUF/AWQ variants (~12–14 GB RAM usage)  
- **Strong contextual handling**: supports long inputs and complex multi-step workflows, ideal for agentic or RAG-based systems  
- **Fully open and commercially usable**, giving you full control over deployment and customization  

### It’s ideal for:
- **Self-hosted AI assistants** that understand nuance, remember context, and generate high-quality responses  
- **On-prem development environments** needing local code completion, documentation, or debugging  
- **Private RAG or enterprise applications** requiring accuracy, reliability, and data sovereignty  
- **Researchers and developers** seeking a powerful, open-weight alternative to closed 10B–20B models  

Choose **Qwen3-14B** when you’ve outgrown 7B–8B models but still want to run efficiently offline—balancing capability, control, and cost without sacrificing quality.

# Qwen3 14B Quantization Guide: Cross-Bit Summary & Recommendations

## Executive Summary

At 14B scale, **quantization quality is exceptional across all bit widths**—models are inherently resilient to compression, with even Q3_K achieving near-lossless fidelity (+2.5% loss with imatrix). All variants deliver production-ready quality, making 14B the "sweet spot" where aggressive quantization meets robust model architecture. The choice depends entirely on your constraints:

| Quantization | Best Variant (+ imatrix) | Quality vs F16 | File Size | Speed | Memory |
|--------------|--------------------------|----------------|-----------|-------|--------|
| **Q5_K** | Q5_K_M + imatrix | **+0.59%** (best) | 9.55 GiB | 63.81 TPS | 10,021 MiB |
| **Q4_K** | Q4_K_M + imatrix | +1.2% | 8.38 GiB | 72.89 TPS | 8,581 MiB |
| **Q3_K** | Q3_K_HIFI + imatrix | +2.5% | 7.93 GiB | 63.93 TPS | 8,120 MiB |

💡 **Critical insight**: 14B models quantize superbly—even **Q3_K_HIFI + imatrix achieves only +2.5% precision loss**, making 3-bit quantization viable for production use. imatrix provides modest but valuable gains, though **Q4_K_HIFI is uniquely harmed by imatrix** (+0.6% degradation).

---

## Bit-Width Recommendations by Use Case

### ✅ Quality-Critical Applications
**→ Q5_K_M + imatrix**  
- Best perplexity at **9.0680 PPL (+0.59% vs F16)** — near-lossless fidelity  
- 64.4% memory reduction (10,021 MiB vs 28,170 MiB)  
- 148% faster than F16 (63.81 TPS vs 25.73 TPS)  
- **Standard llama.cpp compatibility** — no custom builds needed  
- ⚠️ **Avoid Q5_K_HIFI** — provides *no measurable advantage* over Q5_K_M (+0.02% worse with imatrix) while requiring custom build and 2.3% more memory

### ⚖️ Best Overall Balance (Recommended Default)
**→ Q4_K_M + imatrix**  
- Excellent +1.2% precision loss vs F16 (PPL 9.1247)  
- Strong 72.89 TPS speed (+183% vs F16)  
- Compact 8.38 GiB file size (69.5% smaller than F16)  
- **Standard llama.cpp compatibility** — universal toolchain support  
- Ideal for most development and production scenarios

### 🚀 Maximum Speed / Minimum Size
**→ Q3_K_S + imatrix**  
- Fastest variant at **91.32 TPS** (+255% vs F16)  
- Smallest footprint at **6.19 GiB** (77.5% memory reduction)  
- Acceptable +6.5% precision loss with imatrix (unusable at +7.7% without)  
- ⚠️ **Never use Q3_K_S without imatrix** — quality degrades severely

### 📱 Extreme Memory Constraints (< 8 GiB)
**→ Q3_K_S + imatrix**  
- Absolute smallest runtime at **6,339 MiB**  
- Only viable option under 8 GiB budget  
- +6.5% quality loss acceptable for non-critical tasks

### 💎 Near-Lossless 3-Bit Option
**→ Q3_K_HIFI + imatrix**  
- **Surprisingly good quality at +2.5% loss** — production-ready for Q3  
- 71.2% memory reduction (8,120 MiB)  
- Unique value: When you need Q3 size/speed but can't accept Q3_K_S quality  
- ⚠️ **23% slower than Q3_K_M** — significant speed trade-off

---

## Critical Warnings for 14B Scale

⚠️ **Q4_K_HIFI + imatrix is counterproductive** — imatrix *degrades* quality by +0.6% (9.0847 → 9.1393 PPL). This is unique to 14B scale.  
- **Without imatrix**: Q4_K_HIFI is best Q4 quality (+0.8% vs F16)  
- **With imatrix**: Q4_K_M is best Q4 quality (+1.2% vs F16)  
- **Never use imatrix with Q4_K_HIFI at 14B**

⚠️ **Q5_K_HIFI provides zero advantage at 14B**:  
- Quality is *worse* than Q5_K_M with imatrix (+0.61% vs +0.59%)  
- Costs +467 MiB memory (+4.8% overhead) and requires custom build  
- **Skip it entirely** — Q5_K_M is strictly superior for production use

⚠️ **All Q3_K variants are production-ready** — even Q3_K_S with imatrix (+6.5% loss) remains usable, a dramatic improvement over smaller scales where Q3 often fails.  
- Q3_K_HIFI without imatrix: +2.6% loss (excellent)  
- Q3_K_M with imatrix: +2.9% loss (excellent)  
- This is the smallest scale where Q3 quantization is reliably viable

⚠️ **imatrix impact is minimal at 14B** — Unlike smaller models where imatrix recovers 60–78% of lost precision, at 14B the gains are modest (0.1–2.6%):  
- Q5_K variants: +1.1–1.3% improvement  
- Q4_K_M: +0.1% improvement (negligible)  
- Q4_K_S: +0.5% improvement  
- Q3_K_HIFI: -0.1% (no change — already near-perfect)

---

## Memory Budget Guide

| Available VRAM | Recommended Variant | Expected Quality | Why |
|----------------|---------------------|------------------|-----|
| **< 6.5 GiB** | Q3_K_S + imatrix | PPL 9.60, +6.5% loss | Only option that fits; quality acceptable for non-critical tasks |
| **6.5 – 8.2 GiB** | Q3_K_M + imatrix | PPL 9.28, +2.9% loss ✅ | Best Q3 balance; production-ready quality |
| **8.2 – 10.1 GiB** | Q4_K_M + imatrix | PPL 9.12, +1.2% loss ✅ | Best overall balance; standard compatibility |
| **10.1 – 12.0 GiB** | Q5_K_M + imatrix | PPL 9.07, +0.59% loss ✅ | Near-lossless quality; best precision available |
| **> 12.0 GiB** | Q5_K_M + imatrix or F16 | PPL 9.07 or 9.01 | F16 only if absolute precision required |

---

## Cross-Bit Performance Comparison

| Priority | Q3_K Best | Q4_K Best | Q5_K Best | Winner |
|----------|-----------|-----------|-----------|--------|
| **Quality (with imat)** | Q3_K_HIFI (+2.5%) | Q4_K_M (+1.2%) | **Q5_K_M (+0.59%)** ✅ | **Q5_K_M** |
| **Quality (no imat)** | Q3_K_HIFI (+2.6%) | **Q4_K_HIFI (+0.8%)** ✅ | Q5_K_S (+1.84%) | **Q4_K_HIFI** |
| **Speed** | **Q3_K_S (91.32 TPS)** ✅ | Q4_K_S (76.34 TPS) | Q5_K_S (65.40 TPS) | **Q3_K_S** |
| **Smallest Size** | **Q3_K_S (6.19 GiB)** ✅ | Q4_K_S (7.98 GiB) | Q5_K_S (9.33 GiB) | **Q3_K_S** |
| **Best Balance** | Q3_K_M + imat | **Q4_K_M + imat** ✅ | Q5_K_M + imat | **Q4_K_M** |

✅ = Recommended for general use  
⚠️ = Context-dependent (see warnings above)

---

## Scale-Specific Insights: Why 14B Quantizes So Well

1. **Model redundancy threshold**: 14B represents the inflection point where parameter count provides sufficient redundancy that quantization errors average out rather than accumulating. Below 8B, quality degrades more rapidly; above 14B, gains plateau.

2. **Q3_K viability threshold**: 14B is the smallest scale where **Q3_K_HIFI achieves truly production-ready quality** (+2.5% with imatrix). At 8B, Q3_K_HIFI is +3.5%; at 4B, +5.9%; at 1.7B, +3.4% but with much higher baseline PPL.

3. **imatrix diminishing returns**: At 14B, imatrix effectiveness plateaus — Q3_K_HIFI improves by only 0.1%, Q4_K_M by 0.1%, Q5_K variants by 1.1–1.3%. This contrasts sharply with 0.6B (40–48% recovery) and 1.7B (60–78% recovery).

4. **Q4_K_HIFI paradox**: Unlike at 8B (where imatrix helps Q4_K_HIFI by -1.1%) or 32B (where it helps by -0.7%), at 14B imatrix *harms* Q4_K_HIFI (+0.6%). This demonstrates non-linear scale effects in quantization behavior.

5. **Q5_K_HIFI irrelevance**: At 14B, residual quantization provides no measurable benefit — the model's inherent robustness makes the extra precision unnecessary. This changes at 32B where Q5_K_HIFI + imatrix achieves F16-equivalence.

---

## Decision Flowchart

```mermaid
Need best quality?
├─ Yes → Q5_K_M + imatrix (+0.59% loss)
└─ No → Need smallest size/speed?
     ├─ Yes → Memory < 8 GiB? 
     │        ├─ Yes → Q3_K_S + imatrix (6,339 MiB, +6.5% loss)
     │        └─ No  → Q4_K_S + imatrix (8,172 MiB, +1.4% loss, 76.34 TPS)
     └─ No  → Q4_K_M + imatrix (best balance, +1.2% loss, standard build)
```

---

## Practical Deployment Recommendations

### For Most Users
**→ Q4_K_M + imatrix**  
Delivers excellent quality (+1.2% vs F16), strong speed (72.89 TPS), compact size (8.38 GiB), and universal llama.cpp compatibility. The safe, practical choice for 90% of deployments.

### For Quality-Critical Work
**→ Q5_K_M + imatrix**  
Achieves near-lossless quantization (+0.59% vs F16) with 64% memory reduction and 2.5× speedup. Standard compatibility makes it preferable to Q5_K_HIFI, which offers no advantage.

### For Edge/Mobile Deployment
**→ Q3_K_M + imatrix**  
Best Q3 quality (+2.9% vs F16) with smallest viable footprint (6,973 MiB). Production-ready even without imatrix (+5.7% loss) — valuable for environments where imatrix generation isn't feasible.

### For High-Throughput Serving
**→ Q3_K_S + imatrix**  
Fastest variant (91.32 TPS, +255% vs F16) with acceptable quality (+6.5% loss). Ideal when every TPS matters and marginal quality differences are acceptable.

### For Research on Quantization Limits
**→ Q3_K_HIFI + imatrix**  
Demonstrates that 3-bit quantization can achieve near-lossless quality (+2.5% loss) on sufficiently large models. Valuable for characterizing the lower bounds of viable quantization.

---

## Bottom Line Recommendations

| Scenario | Recommended Variant | Rationale |
|----------|---------------------|-----------|
| **Default / General Purpose** | Q4_K_M + imatrix | Best balance of quality, speed, size, and compatibility |
| **Maximum Quality** | Q5_K_M + imatrix | Near-lossless (+0.59% vs F16) with standard toolchain |
| **Minimum Size** | Q3_K_S + imatrix | Smallest footprint (6.19 GiB) with acceptable quality |
| **Maximum Speed** | Q3_K_S + imatrix | Fastest (91.32 TPS) at 3.6× F16 speed |
| **No imatrix available** | Q4_K_HIFI (no imat) | Best quality without imatrix (+0.8% vs F16) |
| **Extreme constraints** | Q3_K_S + imatrix | Only if memory < 8 GiB; +6.5% loss acceptable |

⚠️ **Golden rules for 14B**:  
1. **Never use imatrix with Q4_K_HIFI** — it degrades quality  
2. **Skip Q5_K_HIFI entirely** — no advantage over Q5_K_M  
3. **All three bit widths are viable** — choose based on constraints, not quality cliffs  
4. **Q3_K is production-ready** — the first scale where 3-bit quantization reliably works

✅ **14B is the quantization resilience milestone**: Large enough for robustness across all bit widths, small enough for dramatic efficiency gains. This scale demonstrates that intelligent quantization can deliver near-F16 quality at 1/3 the memory with 2.5–3.5× speed — a compelling value proposition for nearly all deployments.

## Non-technical model anaysis and rankings

**NOTE:** This analysis does not include the HIFI models.

There are two good candidates: **Qwen3-14B-f16:Q3_K_S** and **Qwen3-14B-f16:Q5_K_M**. These cover the full range of temperatures and are good at all question types.

Another good option would be **Qwen3-14B-f16:Q3_K_M**, with good finishes across the temperature range.

**Qwen3-14B-f16:Q2_K** got very good results and would have been a 1st or 2nd place candidate but was the only model to fail the 'hello' question which it should have passed.

You can read the results here: [Qwen3-14b-analysis.md](Qwen3-14b-analysis.md)

If you find this useful, please give the project a ❤️ like.

## Non-HIFI recommentation table based on output

| Level     | Speed     | Size        | Recommendation                                                                                                       |
|-----------|-----------|-------------|----------------------------------------------------------------------------------------------------------------------|
| Q2_K      | ⚡ Fastest | 5.75 GB     | An excellent option but it failed the 'hello' test. Use with caution.                                                |
| 🥇 Q3_K_S | ⚡ Fast    | 6.66 GB     | 🥇 **Best overall model.** Two first places and two 3rd places. Excellent results across the full temperature range. |
| 🥉 Q3_K_M | ⚡ Fast    | 7.32 GB     | 🥉 A good option - it came 1st and 3rd, covering both ends of the temperature range.                                 |
| Q4_K_S    | 🚀 Fast   | 8.57 GB     | Not recommended, two 2nd places in low temperature questions with no other appearances.                              |
| Q4_K_M    | 🚀 Fast   | 9.00 GB     | Not recommended. A single 3rd place with no other appearances.                                                       |
| 🥈 Q5_K_S | 🐢 Medium | 10.3 GB     | 🥈 A very good second place option. A top 3 finisher across the full temperature range.                               |
| Q5_K_M    | 🐢 Medium | 10.5 GB     | Not recommended. A single 3rd place with no other appearances.                                                       |
| Q6_K      | 🐌 Slow   | 12.1 GB     | Not recommended. No top 3 finishes at all.                                                                           |
| Q8_0      | 🐌 Slow   | 15.7 GB     | Not recommended. A single 2nd place with no other appearances. 

## Build notes

All of these models were built using these commands:

```bash
mkdir build
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DLLAMA_CURL=OFF
cmake --build build --config Release -j 
```

**NOTE:** Vulkan support is specifically turned off here. Vulkan performance was much worse, so if you want Vulkan support you can rebuild these models yourself.

The HIFI quantization also used a very large 4697 chunk imatrix file for extra precision. You can re-use it here: [Qwen3-14B-f16-imatrix-4697-generic.gguf](https://huggingface.co/geoffmunn/Qwen3-14B-f16/blob/main/Qwen3-14B-f16-imatrix-4697-generic.gguf)

The imatrix was created as a generic mix of Wikipedia, mathmatics, and coding examples.

### Source code

You can use the HIFI GitHub repository to build it from source if you're interested: [https://github.com/geoffmunn/llama.cpp](https://github.com/geoffmunn/llama.cpp).

Build notes: [HIFI_BUILD_GUIDE.md](https://github.com/geoffmunn/llama.cpp/blob/master/HIFI_BUILD_GUIDE.md)

Improvements and feedback are welcome.

## Usage

Load this model using:
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface with RAG & tools
- [LM Studio](https://lmstudio.ai) – desktop app with GPU support and chat templates
- [GPT4All](https://gpt4all.io) – private, local AI chatbot (offline-first)
- Or directly via `llama.cpp`

Each quantized model includes its own `README.md` and shares a common `MODELFILE` for optimal configuration.

Importing directly into Ollama should work, but you might encounter this error: `Error: invalid character '<' looking for beginning of value`.
In this case try these steps:

1. `wget https://huggingface.co/geoffmunn/Qwen3-14B/resolve/main/Qwen3-14B-f16%3AQ3_K_S.gguf` (replace the quantised version with the one you want)
2. `nano Modelfile` and enter these details (again, replacing Q3_K_S with the version you want):
```text
FROM ./Qwen3-14B-f16:Q3_K_S.gguf

# Chat template using ChatML (used by Qwen)
SYSTEM You are a helpful assistant

TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

# Default sampling
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096
```

The `num_ctx` value has been dropped to increase speed significantly.

3. Then run this command: `ollama create Qwen3-14B-f16:Q3_K_S -f Modelfile`

You will now see "Qwen3-14B-f16:Q3_K_S" in your Ollama model list.

These import steps are also useful if you want to customise the default parameters or system prompt.

## Author

👤 Geoff Munn (@geoffmunn)  
🔗 [Hugging Face Profile](https://huggingface.co/geoffmunn)

## Disclaimer

This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.