fix: emit empty thought-channel primer on historical assistant turns for APC

When enable_thinking=false, the generation prompt inserts an empty
<|channel>thought\n<channel|> block after <|turn>model\n to suppress
thinking. But historical assistant turns in multi-turn replay did not
emit this same block. This caused the KV cache from turn N to diverge
from the prompt for turn N+1, breaking vLLM's automatic prefix caching
(APC) — every multi-turn continuation was a cache miss.

Fix: insert the empty thought-channel primer on historical model turns
when enable_thinking is false and the message has no reasoning content.
Guarded by the existing continue_same_model_turn check to avoid
duplicates in assistant->tool->assistant continuations.

Files changed (1) hide show

chat_template.jinja +3 -0

chat_template.jinja CHANGED Viewed

@@ -226,6 +226,9 @@
     {%- set continue_same_model_turn = (role == 'model' and ns.prev_non_tool_role == 'assistant') -%}
     {%- if not continue_same_model_turn -%}
         {{- '<|turn>' + role + '\n' }}
     {%- endif -%}
     {#- Render reasoning/reasoning_content as thinking channel (tool-call turns only) -#}

     {%- set continue_same_model_turn = (role == 'model' and ns.prev_non_tool_role == 'assistant') -%}
     {%- if not continue_same_model_turn -%}
         {{- '<|turn>' + role + '\n' }}
+        {%- if role == 'model' and not enable_thinking and not (message.get('reasoning') or message.get('reasoning_content')) -%}
+            {{- '<|channel>thought\n<channel|>' -}}
+        {%- endif -%}
     {%- endif -%}
     {#- Render reasoning/reasoning_content as thinking channel (tool-call turns only) -#}