Text Generation
GGUF
qwen
qwen3
qwen3-14b
qwen3-14b-gguf
llama.cpp
quantized
reasoning
agent
multilingual
imatrix
q3_hifi
q4_hifi
q5_hifi
conversational
Instructions to use geoffmunn/Qwen3-14B-f16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use geoffmunn/Qwen3-14B-f16 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="geoffmunn/Qwen3-14B-f16", filename="Qwen3-14B-f16-imatrix-4697-coder.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use geoffmunn/Qwen3-14B-f16 with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M # Run inference directly in the terminal: llama cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M # Run inference directly in the terminal: llama cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf geoffmunn/Qwen3-14B-f16:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf geoffmunn/Qwen3-14B-f16:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Use Docker
docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use geoffmunn/Qwen3-14B-f16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "geoffmunn/Qwen3-14B-f16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "geoffmunn/Qwen3-14B-f16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
- Ollama
How to use geoffmunn/Qwen3-14B-f16 with Ollama:
ollama run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
- Unsloth Studio
How to use geoffmunn/Qwen3-14B-f16 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for geoffmunn/Qwen3-14B-f16 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for geoffmunn/Qwen3-14B-f16 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for geoffmunn/Qwen3-14B-f16 to start chatting
- Pi
How to use geoffmunn/Qwen3-14B-f16 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "geoffmunn/Qwen3-14B-f16:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use geoffmunn/Qwen3-14B-f16 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf geoffmunn/Qwen3-14B-f16:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default geoffmunn/Qwen3-14B-f16:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use geoffmunn/Qwen3-14B-f16 with Docker Model Runner:
docker model run hf.co/geoffmunn/Qwen3-14B-f16:Q4_K_M
- Lemonade
How to use geoffmunn/Qwen3-14B-f16 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull geoffmunn/Qwen3-14B-f16:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3-14B-f16-Q4_K_M
List all available models
lemonade list
| license: apache-2.0 | |
| tags: | |
| - gguf | |
| - qwen | |
| - qwen3-14b | |
| - qwen3-14b-q3 | |
| - qwen3-14b-q3_k_s | |
| - qwen3-14b-q3_k_s-gguf | |
| - llama.cpp | |
| - quantized | |
| - text-generation | |
| - chat | |
| - reasoning | |
| - agent | |
| - multilingual | |
| base_model: Qwen/Qwen3-14B | |
| author: geoffmunn | |
| # Qwen3-14B-f16:Q3_K_S | |
| Quantized version of [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) at **Q3_K_S** level, derived from **f16** base weights. | |
| ## Model Info | |
| - **Format**: GGUF (for llama.cpp and compatible runtimes) | |
| - **Size**: 6.66 GB | |
| - **Precision**: Q3_K_S | |
| - **Base Model**: [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) | |
| - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp) | |
| ## Quality & Performance | |
| | Metric | Value | | |
| |--------------------|----------------------------------------------------------------------------------------------------------------------| | |
| | **Speed** | β‘ Fast | | |
| | **RAM Required** | ~10.2 GB | | |
| | **Recommendation** | π₯ **Best overall model.** Two first places and two 3rd places. Excellent results across the full temperature range. | | |
| ## Prompt Template (ChatML) | |
| This model uses the **ChatML** format used by Qwen: | |
| ```text | |
| <|im_start|>system | |
| You are a helpful assistant.<|im_end|> | |
| <|im_start|>user | |
| {prompt}<|im_end|> | |
| <|im_start|>assistant | |
| ``` | |
| Set this in your app (LM Studio, OpenWebUI, etc.) for best results. | |
| ## Generation Parameters | |
| ### Thinking Mode (Recommended for Logic) | |
| Use when solving math, coding, or logical problems. | |
| | Parameter | Value | | |
| |----------------|-------| | |
| | Temperature | 0.6 | | |
| | Top-P | 0.95 | | |
| | Top-K | 20 | | |
| | Min-P | 0.0 | | |
| | Repeat Penalty | 1.1 | | |
| > β DO NOT use greedy decoding β it causes infinite loops. | |
| Enable via: | |
| - `enable_thinking=True` in tokenizer | |
| - Or add `/think` in user input during conversation | |
| ### Non-Thinking Mode (Fast Dialogue) | |
| For casual chat and quick replies. | |
| | Parameter | Value | | |
| |----------------|-------| | |
| | Temperature | 0.7 | | |
| | Top-P | 0.8 | | |
| | Top-K | 20 | | |
| | Min-P | 0.0 | | |
| | Repeat Penalty | 1.1 | | |
| Enable via: | |
| - `enable_thinking=False` | |
| - Or add `/no_think` in prompt | |
| Stop sequences: `<|im_end|>`, `<|im_start|>` | |
| ## π‘ Usage Tips | |
| > This model supports two operational modes: | |
| > | |
| > ### π Thinking Mode (Recommended for Logic) | |
| > Activate with `enable_thinking=True` or append `/think` in prompt. | |
| > | |
| > - Ideal for: math, coding, planning, analysis | |
| > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20` | |
| > - Avoid greedy decoding | |
| > | |
| > ### β‘ Non-Thinking Mode (Fast Chat) | |
| > Use `enable_thinking=False` or `/no_think`. | |
| > | |
| > - Best for: casual conversation, quick answers | |
| > - Sampling: `temp=0.7`, `top_p=0.8` | |
| > | |
| > --- | |
| > | |
| > π **Switch Dynamically** | |
| > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence. | |
| > | |
| > π **Avoid Repetition** | |
| > Set `presence_penalty=1.5` if stuck in loops. | |
| > | |
| > π **Use Full Context** | |
| > Allow up to 32,768 output tokens for complex tasks. | |
| > | |
| > π§° **Agent Ready** | |
| > Works with Qwen-Agent, MCP servers, and custom tools. | |
| ## Customisation & Troubleshooting | |
| Importing directly into Ollama should work, but you might encounter this error: `Error: invalid character '<' looking for beginning of value`. | |
| In this case try these steps: | |
| 1. `wget https://huggingface.co/geoffmunn/Qwen3-14B-f16/resolve/main/Qwen3-14B-f16%3AQ3_K_S.gguf` | |
| 2. `nano Modelfile` and enter these details: | |
| ```text | |
| FROM ./Qwen3-14B-f16:Q3_K_S.gguf | |
| # Chat template using ChatML (used by Qwen) | |
| SYSTEM You are a helpful assistant | |
| TEMPLATE "{{ if .System }}<|im_start|>system | |
| {{ .System }}<|im_end|>{{ end }}<|im_start|>user | |
| {{ .Prompt }}<|im_end|> | |
| <|im_start|>assistant | |
| " | |
| PARAMETER stop <|im_start|> | |
| PARAMETER stop <|im_end|> | |
| # Default sampling | |
| PARAMETER temperature 0.6 | |
| PARAMETER top_p 0.95 | |
| PARAMETER top_k 20 | |
| PARAMETER min_p 0.0 | |
| PARAMETER repeat_penalty 1.1 | |
| PARAMETER num_ctx 4096 | |
| ``` | |
| The `num_ctx` value has been dropped to increase speed significantly. | |
| 3. Then run this command: `ollama create Qwen3-14B-f16:Q3_K_S -f Modelfile` | |
| You will now see "Qwen3-14B-f16:Q3_K_S" in your Ollama model list. | |
| These import steps are also useful if you want to customise the default parameters or system prompt. | |
| ## π₯οΈ CLI Example Using Ollama or TGI Server | |
| Hereβs how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server. | |
| ```bash | |
| curl http://localhost:11434/api/generate -s -N -d '{ | |
| "model": "hf.co/geoffmunn/Qwen3-14B-f16:Q3_K_S", | |
| "prompt": "Respond exactly as follows: Summarize what a neural network is in one sentence.", | |
| "temperature": 0.3, | |
| "top_p": 0.95, | |
| "top_k": 20, | |
| "min_p": 0.0, | |
| "repeat_penalty": 1.1, | |
| "stream": false | |
| }' | jq -r '.response' | |
| ``` | |
| π― **Why this works well**: | |
| - The prompt is meaningful and achievable for this model size. | |
| - Temperature tuned appropriately: lower for factual (`0.5`), higher for creative (`0.7`). | |
| - Uses `jq` to extract clean output. | |
| ## Verification | |
| Check integrity: | |
| ```bash | |
| sha256sum -c ../SHA256SUMS.txt | |
| ``` | |
| ## Usage | |
| Compatible with: | |
| - [LM Studio](https://lmstudio.ai) β local AI model runner | |
| - [OpenWebUI](https://openwebui.com) β self-hosted AI interface | |
| - [GPT4All](https://gpt4all.io) β private, offline AI chatbot | |
| - Directly via `llama.cpp` | |
| - | |
| ## License | |
| Apache 2.0 β see base model for full terms. | |