froggeric's picture
Add files using upload-large-folder tool
49dc750 verified
|
Raw
History Blame
5.07 kB

Qwen 3.6 Chat Template

A universally fixed Jinja chat template for Qwen 3.6 that serves as a drop-in upgrade for all inference engines (vLLM, llama.cpp, text-generation-webui, LM Studio, oMLX, etc). The official template continues to crash on C++ tool calls, struggles with the new preserve_thinking feature by spamming empty tags, is vulnerable to model hallucinations, and lacks a way to cleanly toggle thinking inline. This universal template handles all of that.

What's broken in the official template

  1. Tool calls crash on C++ engines. The official template uses Python's |items dictionary filter and |safe, neither of which exist in C++ Jinja runtimes (like those used by LM Studio or MLX). Any tool call triggers an out-of-bounds error. It also crashes if the arguments payload is returned as a raw string instead of an object.
  2. No "developer" role. Modern APIs sometimes send message.role == "developer". The official template raises an exception and dies.
  3. Empty preserve_thinking block spam. Qwen 3.6 introduces a preserve_thinking kwarg. If toggled on, the official template wraps every past turn in a <think></think> block, which means a non-reasoning turn wastes context tokens with <think>\n\n</think>.
  4. The </thinking> hallucination. The Qwen 3.6 LLM sometimes mistakenly generates </thinking> at the end of its reasoning block. The official parser expects strictly </think>, resulting in parsing failure and leaking <thinking> tokens into the chat.

What this template does

Universal tool arguments compatibility

Replaced |items iteration with direct dictionary key lookups. Swapped is sequence for is iterable (which strict C++ runtimes require). Removed |safe wrappers and safely map raw JSON fallback schemas so that primitive parameters (like booleans) serialize precisely to JSON standard true instead of crashing environments by generating Python-flavored titlecase "True".

"developer" role support

Intercepts "developer" messages and implicitly maps them to "system". No crash, no data loss.

Smarter preserve_thinking historical context

Now ON by default without any required kwargs! Instead of mindlessly generating empty XML tags for past turns, this template checks if the historical context actually contains reasoning (reasoning_content|trim|length > 0). Only then does it emit an active block into the chat cache, keeping context windows hyper-efficient. Furthermore, history is tied to the <|think_off|> override: disabling thinking in the prompt automatically sweeps older thinking blocks from the cache to drastically accelerate processing.

</thinking> Hallucination handling

During the assistant phase, the logic actively looks for boundary hallucinations. If Qwen generates </thinking>, this template dynamically splits on that literal instead of </think>, cleanly isolating tags seamlessly. If generation is interrupted mid-thought (max tokens/aborts) preventing a closing </think> tag from surfacing, the parser actively rescues the incomplete thought-stream instead of injecting invalid raw <think> pairs into the timeline.

Thinking toggle from any message

Drop <|think_on|> or <|think_off|> anywhere in a prompt. The template detects the tag, strips it iteratively without sequential state-bleeding so the model never sees it, and cascades the thinking state down to the generator prompt dynamically.

System: You are a coding assistant. <|think_off|>
User: Check the weather in Paris.

The tag disappears. The model answers fast, generating <think>\n\n</think>\n\n natively.

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

The model gets its <think>\n prompt and reasons deeply before answering.

Comparison

Feature Official This Fixed Template
Tool arguments work Crashes Fixed
|safe removed Crashes Fixed
"developer" role Missing Added
Thinking toggle None <|think_off|> anywhere
preserve_thinking Spams empty blocks Dynamic length checks
Tag extraction Fails on </thinking> Supports </thinking>

Installation

This template can be used anywhere standard HuggingFace Jinja templates are supported.

General (vLLM, llama.cpp, TextGen)

Simply replace your model's existing chat_template string in your tokenizer_config.json with the minified contents of this file, or load it as a custom template in your UI.

LM Studio

  1. Open LM Studio
  2. Go to the My Models tab (or the right-side panel in Chat)
  3. Select your Qwen 3.6 model
  4. Scroll to Prompt Template
  5. Delete the default template, paste this one in
  6. Save

oMLX

  1. Unload any chat_template_kwargs arguments you may have forced. It is handled by the template actively.
  2. Make sure you load the --jinja flag so the engine utilizes the custom parsing rules.
  3. Overwrite the chat_template.jinja source file locally.