Text Generation
Safetensors
English
multilingual
qwen3_5
qwen3.6
reasoning
distillation
claude-opus
lora
unsloth
fine-tuned
conversational

Chat-template fix: occasional </think> leak + body duplication on long-context tasks (patched template attached)

#3
by janreges3 - opened

Hi @rico03 ,

Quick heads-up on a chat-template issue we hit while running your distill in production on long-context structured-output tasks (Czech and English, 60–90k prompt tokens, 16–80k max_tokens).

What happens

On a non-trivial fraction of runs (we observed it in roughly 1 in 3 to 1 in 5 samples) the response gets corrupted in two related ways:

  1. The </think> token leaks into the content field instead of being captured in reasoning_content β€” i.e. internal monologue ends up visible to the consumer, terminated by a literal </think> tag.

  2. The full final answer is then emitted twice: a clean copy follows the leaked </think>, then a second top-level heading appears and the whole document is re-emitted (often with minor wording variations between the two copies).

The combined effect is that downstream tooling β€” anything expecting one clean markdown document with reasoning stripped β€” gets junk: leaked monologue, then two stitched copies of the answer. On a 175k-token product-sheet synthesis task this produced a 44k-char content field where roughly the first half was internal monologue plus a duplicate of the second half. Pretty messy.

When the bug doesn't fire, output is excellent β€” this is a stochastic / intermittent issue, not deterministic.

Patched chat template

I went through the Jinja and found three issues that are worth fixing together. Patched template + detailed change notes are here:

https://gist.github.com/janreges/609e909215de5489255ee82b7481a321

Summary of the three fixes:

FIX 1 β€” restored the enable_thinking branch using the standard Qwen3
empty-block pre-fill trick (<think>\n\n</think>\n\n). The original
template unconditionally appended \n to the assistant
generation prompt, so chat_template_kwargs={"enable_thinking":false}
was silently a no-op β€” users actually got MORE reasoning chars in
"non-thinking" runs than in default runs in our benchmarks, because
there was no opt-out path at all.

FIX 2 β€” stop emitting an empty <think>\n\n</think>\n\n block when a past
assistant turn has no captured reasoning_content. The original
rendered the empty block whenever loop.last was true, regardless of
whether reasoning was present. This trains/conditions the model on
the pattern "empty thinking block is OK", which we suspect
contributes to the spurious </think> emission in long generation.

FIX 3 β€” hardened the </think> split that recovers reasoning_content from a
stringly-typed content. The original used split('</think>')[-1],
which destructively collapses any output containing multiple
</think> tokens (i.e. the exact failure mode of the bug).

Verification

After applying the patched template I ran the same 175k-token Czech
product-sheet synthesis prompt three times in parallel:

  • 5/5 runs: no </think> in content
  • 5/5 runs: single top-level heading (no body duplication)

Independent of the template

The body-duplication symptom (two copies of the final answer with slightly different wording) likely also has a model-weight component β€” looks like the distill data may contain examples where the assistant summarises the answer inside <think>...</think> and then re-emits it cleanly afterwards. If you have access to the SFT corpus, grepping for assistant turns where </think> appears more than once, or where the same top-level heading appears twice consecutively, would confirm this. The template patch reduces the trigger frequency but cannot fix it entirely without a follow-up SFT pass.

The model card also doesn't currently mention that enable_thinking=false is not supported by the published template β€” adding either FIX 1 or a one-line disclaimer would close that surprise.

Owner

Thanks for the comment! It's very useful. Today I'll fix the problems

Sign up or log in to comment