Chat-template fix: occasional </think> leak + body duplication on long-context tasks (patched template attached)

by janreges3 - opened Apr 28

•

Quick heads-up on a chat-template issue we hit while running your distill in production on long-context structured-output tasks (Czech and English, 60–90k prompt tokens, 16–80k max_tokens).

What happens

On a non-trivial fraction of runs (we observed it in roughly 1 in 3 to 1 in 5 samples) the response gets corrupted in two related ways:

The </think> token leaks into the content field instead of being captured in reasoning_content — i.e. internal monologue ends up visible to the consumer, terminated by a literal </think> tag.
The full final answer is then emitted twice: a clean copy follows the leaked </think>, then a second top-level heading appears and the whole document is re-emitted (often with minor wording variations between the two copies).

The combined effect is that downstream tooling — anything expecting one clean markdown document with reasoning stripped — gets junk: leaked monologue, then two stitched copies of the answer. On a 175k-token product-sheet synthesis task this produced a 44k-char content field where roughly the first half was internal monologue plus a duplicate of the second half. Pretty messy.

When the bug doesn't fire, output is excellent — this is a stochastic / intermittent issue, not deterministic.

Patched chat template

I went through the Jinja and found three issues that are worth fixing together. Patched template + detailed change notes are here:

https://gist.github.com/janreges/609e909215de5489255ee82b7481a321

Summary of the three fixes:

FIX 1 — restored the enable_thinking branch using the standard Qwen3
empty-block pre-fill trick (<think>\n\n</think>\n\n). The original
template unconditionally appended \n to the assistant
generation prompt, so chat_template_kwargs={"enable_thinking":false}
was silently a no-op — users actually got MORE reasoning chars in
"non-thinking" runs than in default runs in our benchmarks, because
there was no opt-out path at all.

FIX 2 — stop emitting an empty <think>\n\n</think>\n\n block when a past
assistant turn has no captured reasoning_content. The original
rendered the empty block whenever loop.last was true, regardless of
whether reasoning was present. This trains/conditions the model on
the pattern "empty thinking block is OK", which we suspect
contributes to the spurious </think> emission in long generation.

FIX 3 — hardened the </think> split that recovers reasoning_content from a
stringly-typed content. The original used split('</think>')[-1],
which destructively collapses any output containing multiple
</think> tokens (i.e. the exact failure mode of the bug).

Verification

After applying the patched template I ran the same 175k-token Czech
product-sheet synthesis prompt three times in parallel:

5/5 runs: no </think> in content
5/5 runs: single top-level heading (no body duplication)

Independent of the template

The body-duplication symptom (two copies of the final answer with slightly different wording) likely also has a model-weight component — looks like the distill data may contain examples where the assistant summarises the answer inside <think>...</think> and then re-emits it cleanly afterwards. If you have access to the SFT corpus, grepping for assistant turns where </think> appears more than once, or where the same top-level heading appears twice consecutively, would confirm this. The template patch reduces the trigger frequency but cannot fix it entirely without a follow-up SFT pass.

The model card also doesn't currently mention that enable_thinking=false is not supported by the published template — adding either FIX 1 or a one-line disclaimer would close that surprise.

rico03

Owner Apr 28

Thanks for the comment! It's very useful. Today I'll fix the problems

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment