lucianommartins commited on
Commit
706856c
·
verified ·
1 Parent(s): 5330c5f

fix: emit empty thought-channel primer on historical assistant turns for APC

Browse files

When enable_thinking=false, the generation prompt inserts an empty
<|channel>thought\n<channel|> block after <|turn>model\n to suppress
thinking. But historical assistant turns in multi-turn replay did not
emit this same block. This caused the KV cache from turn N to diverge
from the prompt for turn N+1, breaking vLLM's automatic prefix caching
(APC) — every multi-turn continuation was a cache miss.

Fix: insert the empty thought-channel primer on historical model turns
when enable_thinking is false and the message has no reasoning content.
Guarded by the existing continue_same_model_turn check to avoid
duplicates in assistant->tool->assistant continuations.

Files changed (1) hide show
  1. chat_template.jinja +3 -0
chat_template.jinja CHANGED
@@ -226,6 +226,9 @@
226
  {%- set continue_same_model_turn = (role == 'model' and ns.prev_non_tool_role == 'assistant') -%}
227
  {%- if not continue_same_model_turn -%}
228
  {{- '<|turn>' + role + '\n' }}
 
 
 
229
  {%- endif -%}
230
 
231
  {#- Render reasoning/reasoning_content as thinking channel (tool-call turns only) -#}
 
226
  {%- set continue_same_model_turn = (role == 'model' and ns.prev_non_tool_role == 'assistant') -%}
227
  {%- if not continue_same_model_turn -%}
228
  {{- '<|turn>' + role + '\n' }}
229
+ {%- if role == 'model' and not enable_thinking and not (message.get('reasoning') or message.get('reasoning_content')) -%}
230
+ {{- '<|channel>thought\n<channel|>' -}}
231
+ {%- endif -%}
232
  {%- endif -%}
233
 
234
  {#- Render reasoning/reasoning_content as thinking channel (tool-call turns only) -#}