File size: 5,561 Bytes
e4abdf6
 
 
 
 
1fd63ff
 
 
 
e4abdf6
 
 
 
 
 
 
 
 
 
 
41e1fb8
e4abdf6
 
 
 
 
 
 
 
 
 
 
 
 
1fd63ff
 
 
 
 
e4abdf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1fd63ff
 
e4abdf6
1fd63ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4abdf6
 
 
1fd63ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41e1fb8
1fd63ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4abdf6
 
41e1fb8
e4abdf6
 
 
41e1fb8
e4abdf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
license: apache-2.0
tags:
  - gguf
  - qwen
  - qwen3-14b
  - qwen3-14b-q3
  - qwen3-14b-q3_k_m
  - qwen3-14b-q3_k_m-gguf
  - llama.cpp
  - quantized
  - text-generation
  - chat
  - reasoning
  - agent
  - multilingual
base_model: Qwen/Qwen3-14B
author: geoffmunn
---

# Qwen3-14B-f16:Q3_K_M

Quantized version of [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) at **Q3_K_M** level, derived from **f16** base weights.

## Model Info

- **Format**: GGUF (for llama.cpp and compatible runtimes)
- **Size**: 7.32 GB
- **Precision**: Q3_K_M
- **Base Model**: [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)

## Quality & Performance

| Metric             | Value                                                                                |
|--------------------|--------------------------------------------------------------------------------------|
| **Speed**          | ⚑ Fast                                                                               |
| **RAM Required**   | ~10.7 GB                                                                             |
| **Recommendation** | πŸ₯‰ A good option - it came 1st and 3rd, covering both ends of the temperature range.  |

## Prompt Template (ChatML)

This model uses the **ChatML** format used by Qwen:

```text
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
```

Set this in your app (LM Studio, OpenWebUI, etc.) for best results.

## Generation Parameters

### Thinking Mode (Recommended for Logic)
Use when solving math, coding, or logical problems.

| Parameter      | Value |
|----------------|-------|
| Temperature    | 0.6   |
| Top-P          | 0.95  |
| Top-K          | 20    |
| Min-P          | 0.0   |
| Repeat Penalty | 1.1   |

> ❗ DO NOT use greedy decoding β€” it causes infinite loops.

Enable via:
- `enable_thinking=True` in tokenizer
- Or add `/think` in user input during conversation

### Non-Thinking Mode (Fast Dialogue)
For casual chat and quick replies.

| Parameter      | Value |
|----------------|-------|
| Temperature    | 0.7   |
| Top-P          | 0.8   |
| Top-K          | 20    |
| Min-P          | 0.0   |
| Repeat Penalty | 1.1   |

Enable via:
- `enable_thinking=False`
- Or add `/no_think` in prompt

Stop sequences: `<|im_end|>`, `<|im_start|>`

## πŸ’‘ Usage Tips

> This model supports two operational modes:
>
> ### πŸ” Thinking Mode (Recommended for Logic)
> Activate with `enable_thinking=True` or append `/think` in prompt.
>
> - Ideal for: math, coding, planning, analysis
> - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
> - Avoid greedy decoding
>
> ### ⚑ Non-Thinking Mode (Fast Chat)
> Use `enable_thinking=False` or `/no_think`.
>
> - Best for: casual conversation, quick answers
> - Sampling: `temp=0.7`, `top_p=0.8`
>
> ---
>
> πŸ”„ **Switch Dynamically**  
> In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
>
> πŸ” **Avoid Repetition**  
> Set `presence_penalty=1.5` if stuck in loops.
>
> πŸ“ **Use Full Context**  
> Allow up to 32,768 output tokens for complex tasks.
>
> 🧰 **Agent Ready**  
> Works with Qwen-Agent, MCP servers, and custom tools.

## Customisation & Troubleshooting

Importing directly into Ollama should work, but you might encounter this error: `Error: invalid character '<' looking for beginning of value`.
In this case try these steps:

1. `wget https://huggingface.co/geoffmunn/Qwen3-14B-f16/resolve/main/Qwen3-14B-f16%3AQ3_K_M.gguf`
2. `nano Modelfile` and enter these details:
```text
FROM ./Qwen3-14B-f16:Q3_K_M.gguf
 
# Chat template using ChatML (used by Qwen)
SYSTEM You are a helpful assistant

TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

# Default sampling
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096
```

The `num_ctx` value has been dropped to increase speed significantly.

3. Then run this command: `ollama create Qwen3-14B-f16:Q3_K_M -f Modelfile`

You will now see "Qwen3-14B-f16:Q3_K_M" in your Ollama model list.

These import steps are also useful if you want to customise the default parameters or system prompt.

## πŸ–₯️ CLI Example Using Ollama or TGI Server

Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server.

```bash
curl http://localhost:11434/api/generate -s -N -d '{
  "model": "hf.co/geoffmunn/Qweb3-14B-f16:Q3_K_M",
  "prompt": "Respond exactly as follows: Summarize what a neural network is in one sentence.",
  "temperature": 0.3,
  "top_p": 0.95,
  "top_k": 20,
  "min_p": 0.0,
  "repeat_penalty": 1.1,
  "stream": false
}' | jq -r '.response'
```

🎯 **Why this works well**:
- The prompt is meaningful and achievable for this model size.
- Temperature tuned appropriately: lower for factual (`0.5`), higher for creative (`0.7`).
- Uses `jq` to extract clean output.

## Verification

Check integrity:

```bash
sha256sum -c ../SHA256SUMS.txt
```

## Usage

Compatible with:
- [LM Studio](https://lmstudio.ai) – local AI model runner
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
- Directly via `llama.cpp`

## License

Apache 2.0 – see base model for full terms.