File size: 5,074 Bytes
49dc750
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Qwen 3.6 Chat Template

A universally fixed Jinja chat template for Qwen 3.6 that serves as a drop-in upgrade for **all inference engines** (vLLM, llama.cpp, text-generation-webui, LM Studio, oMLX, etc). The official template continues to crash on C++ tool calls, struggles with the new `preserve_thinking` feature by spamming empty tags, is vulnerable to model hallucinations, and lacks a way to cleanly toggle thinking inline. This universal template handles all of that.

## What's broken in the official template

1. **Tool calls crash on C++ engines.** The official template uses Python's `|items` dictionary filter and `|safe`, neither of which exist in C++ Jinja runtimes (like those used by LM Studio or MLX). Any tool call triggers an out-of-bounds error. It also crashes if the arguments payload is returned as a raw string instead of an object.
2. **No `"developer"` role.** Modern APIs sometimes send `message.role == "developer"`. The official template raises an exception and dies.
3. **Empty `preserve_thinking` block spam.** Qwen 3.6 introduces a `preserve_thinking` kwarg. If toggled on, the official template wraps *every past turn* in a `<think></think>` block, which means a non-reasoning turn wastes context tokens with `<think>\n\n</think>`.
4. **The `</thinking>` hallucination.** The Qwen 3.6 LLM sometimes mistakenly generates `</thinking>` at the end of its reasoning block. The official parser expects strictly `</think>`, resulting in parsing failure and leaking `<thinking>` tokens into the chat.

## What this template does

### Universal tool arguments compatibility

Replaced `|items` iteration with direct dictionary key lookups. Swapped `is sequence` for `is iterable` (which strict C++ runtimes require). Removed `|safe` wrappers and safely map raw JSON fallback schemas so that primitive parameters (like booleans) serialize precisely to JSON standard `true` instead of crashing environments by generating Python-flavored titlecase `"True"`.

### `"developer"` role support

Intercepts `"developer"` messages and implicitly maps them to `"system"`. No crash, no data loss.

### Smarter `preserve_thinking` historical context

**Now ON by default without any required kwargs!** Instead of mindlessly generating empty XML tags for past turns, this template checks if the historical context actually contains reasoning `(reasoning_content|trim|length > 0)`. Only then does it emit an active block into the chat cache, keeping context windows hyper-efficient. Furthermore, history is tied to the `<|think_off|>` override: disabling thinking in the prompt automatically sweeps older thinking blocks from the cache to drastically accelerate processing.

### `</thinking>` Hallucination handling

During the assistant phase, the logic actively looks for boundary hallucinations. If Qwen generates `</thinking>`, this template dynamically splits on that literal instead of `</think>`, cleanly isolating tags seamlessly. If generation is interrupted mid-thought (max tokens/aborts) preventing a closing `</think>` tag from surfacing, the parser actively rescues the incomplete thought-stream instead of injecting invalid raw `<think>` pairs into the timeline.

### Thinking toggle from any message

Drop `<|think_on|>` or `<|think_off|>` anywhere in a prompt. The template detects the tag, strips it iteratively without sequential state-bleeding so the model never sees it, and cascades the thinking state down to the generator prompt dynamically.

```text
System: You are a coding assistant. <|think_off|>
User: Check the weather in Paris.
```

The tag disappears. The model answers fast, generating `<think>\n\n</think>\n\n` natively.

```text
System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.
```

The model gets its `<think>\n` prompt and reasons deeply before answering.

## Comparison

| Feature | Official | **This Fixed Template** |
|---|---|---|
| Tool arguments work | Crashes | **Fixed** |
| `\|safe` removed | Crashes | **Fixed** |
| `"developer"` role | Missing | **Added** |
| Thinking toggle | None | **`<\|think_off\|>` anywhere** |
| `preserve_thinking` | Spams empty blocks | **Dynamic length checks** |
| Tag extraction | Fails on `</thinking>` | **Supports `</thinking>`** |

## Installation

This template can be used anywhere standard HuggingFace Jinja templates are supported.

### General (vLLM, llama.cpp, TextGen)
Simply replace your model's existing `chat_template` string in your `tokenizer_config.json` with the minified contents of this file, or load it as a custom template in your UI.

### LM Studio
1. Open LM Studio
2. Go to the **My Models** tab (or the right-side panel in Chat)
3. Select your Qwen 3.6 model
4. Scroll to **Prompt Template**
5. Delete the default template, paste this one in
6. Save

### oMLX
1. Unload any `chat_template_kwargs` arguments you may have forced. It is handled by the template actively.
2. Make sure you load the `--jinja` flag so the engine utilizes the custom parsing rules.
3. Overwrite the `chat_template.jinja` source file locally.