Instructions to use geoffmunn/Qwen3-14B-f16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use geoffmunn/Qwen3-14B-f16 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="geoffmunn/Qwen3-14B-f16",
	filename="Qwen3-14B-f16-imatrix-4697-coder.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use geoffmunn/Qwen3-14B-f16 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Use Docker

docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M

LM Studio
Jan

vLLM

How to use geoffmunn/Qwen3-14B-f16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "geoffmunn/Qwen3-14B-f16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "geoffmunn/Qwen3-14B-f16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M

Ollama
How to use geoffmunn/Qwen3-14B-f16 with Ollama:
```
ollama run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
```

Unsloth Studio

How to use geoffmunn/Qwen3-14B-f16 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for geoffmunn/Qwen3-14B-f16 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for geoffmunn/Qwen3-14B-f16 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for geoffmunn/Qwen3-14B-f16 to start chatting

How to use geoffmunn/Qwen3-14B-f16 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "geoffmunn/Qwen3-14B-f16:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use geoffmunn/Qwen3-14B-f16 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default geoffmunn/Qwen3-14B-f16:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use geoffmunn/Qwen3-14B-f16 with Docker Model Runner:
```
docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
```

Lemonade

How to use geoffmunn/Qwen3-14B-f16 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull geoffmunn/Qwen3-14B-f16:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-14B-f16-Q4_K_M

List all available models

lemonade list

Qwen3-14B-f16 / Qwen3-14B-f16-Q3_K_S /README.md

geoffmunn

Rename Qwen3-14B-Q3_K_S/README.md to Qwen3-14B-f16-Q3_K_S/README.md

cf7bd87 verified 6 months ago

preview code

Raw

History Blame

5.73 kB

	---
	license: apache-2.0
	tags:
	- gguf
	- qwen
	- qwen3-14b
	- qwen3-14b-q3
	- qwen3-14b-q3_k_s
	- qwen3-14b-q3_k_s-gguf
	- llama.cpp
	- quantized
	- text-generation
	- chat
	- reasoning
	- agent
	- multilingual
	base_model: Qwen/Qwen3-14B
	author: geoffmunn
	---

	# Qwen3-14B-f16:Q3_K_S

	Quantized version of [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) at Q3_K_S level, derived from f16 base weights.

	## Model Info

	- Format: GGUF (for llama.cpp and compatible runtimes)
	- Size: 6.66 GB
	- Precision: Q3_K_S
	- Base Model: [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)
	- Conversion Tool: [llama.cpp](https://github.com/ggerganov/llama.cpp)

	## Quality & Performance

	\| Metric \| Value \|
	\|--------------------\|----------------------------------------------------------------------------------------------------------------------\|
	\| Speed \| ⚡ Fast \|
	\| RAM Required \| ~10.2 GB \|
	\| Recommendation \| 🥇 Best overall model. Two first places and two 3rd places. Excellent results across the full temperature range. \|

	## Prompt Template (ChatML)

	This model uses the ChatML format used by Qwen:

	```text
	<\|im_start\|>system
	You are a helpful assistant.<\|im_end\|>
	<\|im_start\|>user
	{prompt}<\|im_end\|>
	<\|im_start\|>assistant
	```

	Set this in your app (LM Studio, OpenWebUI, etc.) for best results.

	## Generation Parameters

	### Thinking Mode (Recommended for Logic)
	Use when solving math, coding, or logical problems.

	\| Parameter \| Value \|
	\|----------------\|-------\|
	\| Temperature \| 0.6 \|
	\| Top-P \| 0.95 \|
	\| Top-K \| 20 \|
	\| Min-P \| 0.0 \|
	\| Repeat Penalty \| 1.1 \|

	> ❗ DO NOT use greedy decoding — it causes infinite loops.

	Enable via:
	- `enable_thinking=True` in tokenizer
	- Or add `/think` in user input during conversation

	### Non-Thinking Mode (Fast Dialogue)
	For casual chat and quick replies.

	\| Parameter \| Value \|
	\|----------------\|-------\|
	\| Temperature \| 0.7 \|
	\| Top-P \| 0.8 \|
	\| Top-K \| 20 \|
	\| Min-P \| 0.0 \|
	\| Repeat Penalty \| 1.1 \|

	Enable via:
	- `enable_thinking=False`
	- Or add `/no_think` in prompt

	Stop sequences: `<\|im_end\|>`, `<\|im_start\|>`

	## 💡 Usage Tips

	> This model supports two operational modes:
	>
	> ### 🔍 Thinking Mode (Recommended for Logic)
	> Activate with `enable_thinking=True` or append `/think` in prompt.
	>
	> - Ideal for: math, coding, planning, analysis
	> - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
	> - Avoid greedy decoding
	>
	> ### ⚡ Non-Thinking Mode (Fast Chat)
	> Use `enable_thinking=False` or `/no_think`.
	>
	> - Best for: casual conversation, quick answers
	> - Sampling: `temp=0.7`, `top_p=0.8`
	>
	> ---
	>
	> 🔄 Switch Dynamically
	> In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
	>
	> 🔁 Avoid Repetition
	> Set `presence_penalty=1.5` if stuck in loops.
	>
	> 📏 Use Full Context
	> Allow up to 32,768 output tokens for complex tasks.
	>
	> 🧰 Agent Ready
	> Works with Qwen-Agent, MCP servers, and custom tools.

	## Customisation & Troubleshooting

	Importing directly into Ollama should work, but you might encounter this error: `Error: invalid character '<' looking for beginning of value`.
	In this case try these steps:

	1. `wget https://huggingface.co/geoffmunn/Qwen3-14B-f16/resolve/main/Qwen3-14B-f16%3AQ3_K_S.gguf`
	2. `nano Modelfile` and enter these details:
	```text
	FROM ./Qwen3-14B-f16:Q3_K_S.gguf

	# Chat template using ChatML (used by Qwen)
	SYSTEM You are a helpful assistant

	TEMPLATE "{{ if .System }}<\|im_start\|>system
	{{ .System }}<\|im_end\|>{{ end }}<\|im_start\|>user
	{{ .Prompt }}<\|im_end\|>
	<\|im_start\|>assistant
	"
	PARAMETER stop <\|im_start\|>
	PARAMETER stop <\|im_end\|>

	# Default sampling
	PARAMETER temperature 0.6
	PARAMETER top_p 0.95
	PARAMETER top_k 20
	PARAMETER min_p 0.0
	PARAMETER repeat_penalty 1.1
	PARAMETER num_ctx 4096
	```

	The `num_ctx` value has been dropped to increase speed significantly.

	3. Then run this command: `ollama create Qwen3-14B-f16:Q3_K_S -f Modelfile`

	You will now see "Qwen3-14B-f16:Q3_K_S" in your Ollama model list.

	These import steps are also useful if you want to customise the default parameters or system prompt.

	## 🖥️ CLI Example Using Ollama or TGI Server

	Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server.

	```bash
	curl http://localhost:11434/api/generate -s -N -d '{
	"model": "hf.co/geoffmunn/Qwen3-14B-f16:Q3_K_S",
	"prompt": "Respond exactly as follows: Summarize what a neural network is in one sentence.",
	"temperature": 0.3,
	"top_p": 0.95,
	"top_k": 20,
	"min_p": 0.0,
	"repeat_penalty": 1.1,
	"stream": false
	}' \| jq -r '.response'
	```

	🎯 Why this works well:
	- The prompt is meaningful and achievable for this model size.
	- Temperature tuned appropriately: lower for factual (`0.5`), higher for creative (`0.7`).
	- Uses `jq` to extract clean output.

	## Verification

	Check integrity:

	```bash
	sha256sum -c ../SHA256SUMS.txt
	```

	## Usage

	Compatible with:
	- [LM Studio](https://lmstudio.ai) – local AI model runner
	- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
	- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
	- Directly via `llama.cpp`
	-
	## License

	Apache 2.0 – see base model for full terms.