--- license: apache-2.0 tags: - gguf - qwen - qwen3 - qwen3-14b - qwen3-14b-gguf - llama.cpp - quantized - text-generation - reasoning - agent - multilingual - imatrix - q3_hifi base_model: Qwen/Qwen3-14B author: geoffmunn pipeline_tag: text-generation language: - en - zh - es - fr - de - ru - ar - ja - ko - hi --- # Qwen3-14B-f16-GGUF This is a **GGUF-quantized version** of the **[Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B)** language model — a **14-billion-parameter** LLM with deep reasoning, research-grade accuracy, and autonomous workflows. Converted for use with \llama.cpp\, [LM Studio](https://lmstudio.ai), [OpenWebUI](https://openwebui.com), [GPT4All](https://gpt4all.io), and more. **NEW:** I have a custom model called **_Q3_HIFI_**, which is better than the standard _Q3_K_M_ model. It is higher quality, smaller in size, and nearly the same speed as _Q3_K_M_. It is listed under the 'f16' options because it's not an officially recognised type (at the moment). ## Q3_HIFI **Pros:** - 🏆 **Best quality** with lowest perplexity of 9.38 (1.6% better than Q3_K_M, 3.4% better than Q3_K_S) - 📦 **Smaller than Q3_K_M** (6.59 vs 6.81 GiB) while being significantly better quality - 🎯 Uses intelligent layer-sensitive quantization (Q3_HIFI on sensitive layers, mixed q3_K/q4_K elsewhere) - ⚡ **Slightly faster than Q3_K_M** (85.58 vs 85.40 TPS) **Cons:** - 🐢 **Slower than Q3_K_S** at 85.58 TPS (6.5% slower than Q3_K_S) - 🔧 Custom quantization may have less community support **Best for:** Production deployments where output quality matters, tasks requiring accuracy (reasoning, coding, complex instructions), or when you want the best quality-to-size ratio. You can read more about how it compares to _Q3_K_M_ and _Q3_K_S_ here: [Q3_Quantization_Comparison.md](Q3_Quantization_Comparison.md) You can also view a cross-model comparison of the Q3_HIFI type [here](https://github.com/geoffmunn/llama.cpp/blob/master/docs/quantization/Q3_HIFI.md). ## Available Quantizations (from f16) | Level | Speed | Size | Recommendation | |-----------|-----------|-------------|----------------------------------------------------------------------------------------------------------------------| | Q2_K | ⚡ Fastest | 5.75 GB | An excellent option but it failed the 'hello' test. Use with caution. | | 🥇 Q3_K_S | ⚡ Fast | 6.66 GB | 🥇 **Best overall model.** Two first places and two 3rd places. Excellent results across the full temperature range. | | 🥉 Q3_K_M | ⚡ Fast | 7.32 GB | 🥉 A good option - it came 1st and 3rd, covering both ends of the temperature range. | | Q4_K_S | 🚀 Fast | 8.57 GB | Not recommended, two 2nd places in low temperature questions with no other appearances. | | Q4_K_M | 🚀 Fast | 9.00 GB | Not recommended. A single 3rd place with no other appearances. | | 🥈 Q5_K_S | 🐢 Medium | 10.3 GB | 🥈 A very good second place option. A top 3 finisher across the full temperature range. | | Q5_K_M | 🐢 Medium | 10.5 GB | Not recommended. A single 3rd place with no other appearances. | | Q6_K | 🐌 Slow | 12.1 GB | Not recommended. No top 3 finishes at all. | | Q8_0 | 🐌 Slow | 15.7 GB | Not recommended. A single 2nd place with no other appearances. | Certainly! Here's a polished and purpose-driven description for the **Qwen3-14B** model: --- ## Why Use a 14B Model? The **Qwen3-14B** model delivers **serious intelligence in a locally runnable package**, offering near-flagship performance while remaining feasible to run on a single high-end consumer GPU or a well-equipped CPU setup. It’s the optimal choice when you need strong reasoning, robust code generation, and deep language understanding—without relying on the cloud or massive infrastructure. ### Highlights: - **State-of-the-art performance among open 14B-class models**, excelling in reasoning, math, coding, and multilingual tasks - **Efficient inference with quantization**: runs on a 24 GB GPU (e.g., RTX 4090) or even CPU with quantized GGUF/AWQ variants (~12–14 GB RAM usage) - **Strong contextual handling**: supports long inputs and complex multi-step workflows, ideal for agentic or RAG-based systems - **Fully open and commercially usable**, giving you full control over deployment and customization ### It’s ideal for: - **Self-hosted AI assistants** that understand nuance, remember context, and generate high-quality responses - **On-prem development environments** needing local code completion, documentation, or debugging - **Private RAG or enterprise applications** requiring accuracy, reliability, and data sovereignty - **Researchers and developers** seeking a powerful, open-weight alternative to closed 10B–20B models Choose **Qwen3-14B** when you’ve outgrown 7B–8B models but still want to run efficiently offline—balancing capability, control, and cost without sacrificing quality. ## Build notes All of these models (including _Q3_HIFI_) where built using these commands: ```bash mkdir build cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DLLAMA_CURL=OFF cmake --build build --config Release -j ``` **NOTE:** Vulkan support is specifically turned off here. Vulkan performance was much worse, so if you want Vulkan support you can rebuild these models yourself. The quantisation for Q3_HIFI also used a 5000 chunk imatrix file for extra precision. You can re-use it here: [Qwen3-14B-f16-imatrix-5000.gguf](Qwen3-14B-f16-imatrix-5000.gguf) You can use the Q3_HIFI GitHub repository to build it from source if you're interested (use the `Q3_HIFI` branch)[https://github.com/geoffmunn/llama.cpp](https://github.com/geoffmunn/llama.cpp). ## Model anaysis and rankings There are two good candidates: **Qwen3-14B-f16:Q3_K_S** and **Qwen3-14B-f16:Q5_K_M**. These cover the full range of temperatures and are good at all question types. Another good option would be **Qwen3-14B-f16:Q3_K_M**, with good finishes across the temperature range. **Qwen3-14B-f16:Q2_K** got very good results and would have been a 1st or 2nd place candidate but was the only model to fail the 'hello' question which it should have passed. You can read the results here: [Qwen3-14b-analysis.md](Qwen3-14b-analysis.md) If you find this useful, please give the project a ❤️ like. ## Usage Load this model using: - [OpenWebUI](https://openwebui.com) – self-hosted AI interface with RAG & tools - [LM Studio](https://lmstudio.ai) – desktop app with GPU support and chat templates - [GPT4All](https://gpt4all.io) – private, local AI chatbot (offline-first) - Or directly via `llama.cpp` Each quantized model includes its own `README.md` and shares a common `MODELFILE` for optimal configuration. Importing directly into Ollama should work, but you might encounter this error: `Error: invalid character '<' looking for beginning of value`. In this case try these steps: 1. `wget https://huggingface.co/geoffmunn/Qwen3-14B/resolve/main/Qwen3-14B-f16%3AQ3_K_S.gguf` (replace the quantised version with the one you want) 2. `nano Modelfile` and enter these details (again, replacing Q3_K_S with the version you want): ```text FROM ./Qwen3-14B-f16:Q3_K_S.gguf # Chat template using ChatML (used by Qwen) SYSTEM You are a helpful assistant TEMPLATE "{{ if .System }}<|im_start|>system {{ .System }}<|im_end|>{{ end }}<|im_start|>user {{ .Prompt }}<|im_end|> <|im_start|>assistant " PARAMETER stop <|im_start|> PARAMETER stop <|im_end|> # Default sampling PARAMETER temperature 0.6 PARAMETER top_p 0.95 PARAMETER top_k 20 PARAMETER min_p 0.0 PARAMETER repeat_penalty 1.1 PARAMETER num_ctx 4096 ``` The `num_ctx` value has been dropped to increase speed significantly. 3. Then run this command: `ollama create Qwen3-14B-f16:Q3_K_S -f Modelfile` You will now see "Qwen3-14B-f16:Q3_K_S" in your Ollama model list. These import steps are also useful if you want to customise the default parameters or system prompt. ## Author 👤 Geoff Munn (@geoffmunn) 🔗 [Hugging Face Profile](https://huggingface.co/geoffmunn) ## Disclaimer This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.