llmartin commited on
Commit
ddd6554
·
1 Parent(s): 96389c0

add: reasoning parser plugin compatible with vllm>= 0.21.0

Browse files
Files changed (2) hide show
  1. README.md +18 -1
  2. reasoning_parser_plugin.py +297 -0
README.md CHANGED
@@ -95,7 +95,9 @@ vllm serve domyn/Domyn-Small-v1.0 \
95
 
96
  ### vLLM — With Reasoning Parsing
97
 
98
- To have vLLM automatically extract the model's `<think>` blocks and expose them as a structured `reasoning_content` field, add `--reasoning-parser olmo3`. Domyn Small emits the identical `<think>…</think>` format as OLMo 3, so the OLMo 3 parser plugin works directly — no Domyn-specific parser is required.
 
 
99
 
100
  ```bash
101
  vllm serve domyn/Domyn-Small-v1.0 \
@@ -107,6 +109,21 @@ vllm serve domyn/Domyn-Small-v1.0 \
107
  --reasoning-parser olmo3
108
  ```
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ### vLLM — Extended Context with YaRN
111
 
112
  > YaRN scaling may impact model quality on inputs shorter than 32k. Enable it only when you actually need contexts beyond the native 32,768-token window.
 
95
 
96
  ### vLLM — With Reasoning Parsing
97
 
98
+ To have vLLM automatically extract the model's `<think>` blocks and expose them as a structured `reasoning_content` field, add a reasoning-parser flag. Which flag to use depends on your vLLM version.
99
+
100
+ **vLLM < 0.21.0** — Domyn Small emits the same `<think>…</think>` format as OLMo 3, and earlier vLLM releases work with the OLMo 3 parser directly:
101
 
102
  ```bash
103
  vllm serve domyn/Domyn-Small-v1.0 \
 
109
  --reasoning-parser olmo3
110
  ```
111
 
112
+ **vLLM ≥ 0.21.0 (recommended)** — use the Domyn-specific parser plugin shipped with this checkpoint (`reasoning_parser_plugin.py`). It reads the per-request `enable_thinking` flag (or the `thinking on` / `thinking off` system-prompt directive) and routes streamed output to the correct lane (`reasoning` vs `content`) for both modes.
113
+
114
+ ```bash
115
+ vllm serve domyn/Domyn-Small-v1.0 \
116
+ --tensor-parallel-size 1 \
117
+ --dtype bfloat16 \
118
+ --max-model-len 32768 \
119
+ --max-num-seqs 256 \
120
+ --gpu-memory-utilization 0.9 \
121
+ --reasoning-parser think_block \
122
+ --reasoning-parser-plugin /path/to/reasoning_parser_plugin.py
123
+ ```
124
+
125
+ Replace `/path/to/` with the actual path to the plugin file bundled with the checkpoint. The parser name `think_block` is the registration string declared inside the plugin and must match exactly.
126
+
127
  ### vLLM — Extended Context with YaRN
128
 
129
  > YaRN scaling may impact model quality on inputs shorter than 32k. Enable it only when you actually need contexts beyond the native 32,768-token window.
reasoning_parser_plugin.py ADDED
@@ -0,0 +1,297 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Reasoning parser plugin for Domyn-Small ``<think>...</think>`` outputs.
2
+
3
+ Loaded into vLLM with ``--reasoning-parser-plugin <path>`` and selected via
4
+ ``--reasoning-parser think_block``. The parser splits each model output on
5
+ the literal ``</think>`` marker: everything before it is reasoning,
6
+ everything after is final content.
7
+
8
+ See :class:`ThinkBlockReasoningParser` for the streaming state machine and
9
+ how per-request thinking-on/off is discovered.
10
+ """
11
+
12
+ from __future__ import annotations
13
+
14
+ from collections.abc import Iterable, Sequence
15
+ from typing import TYPE_CHECKING
16
+
17
+ from vllm.reasoning import ReasoningParser, ReasoningParserManager
18
+
19
+ if TYPE_CHECKING:
20
+ from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
21
+ from vllm.entrypoints.openai.engine.protocol import DeltaMessage
22
+ from vllm.entrypoints.openai.responses.protocol import ResponsesRequest
23
+
24
+ # Literal markers emitted by the Domyn-Small chat template. `<think>` is
25
+ # pre-emitted by the prompt, so model output never starts with it; only `</think>`
26
+ # actually has to be detected at runtime.
27
+ START = "<think>"
28
+ END = "</think>"
29
+
30
+
31
+ def _max_suffix_prefix(s: str, marker: str) -> str:
32
+ """Longest non-empty suffix of ``s`` that is also a prefix of ``marker``.
33
+
34
+ Used to decide how many trailing bytes of the streaming buffer must be
35
+ held back — if those bytes could still grow into ``marker`` on the next
36
+ delta, releasing them now would fragment the marker across deltas (e.g.
37
+ emitting ``</thi`` and then ``nk>``).
38
+ """
39
+ for i in range(min(len(marker) - 1, len(s)), 0, -1):
40
+ if s.endswith(marker[:i]):
41
+ return s[-i:]
42
+ return ""
43
+
44
+
45
+ @ReasoningParserManager.register_module("think_block")
46
+ class ThinkBlockReasoningParser(ReasoningParser):
47
+ """Splits model output on the literal ``</think>`` marker.
48
+
49
+ **Streaming.** Olmo3-style buffered state machine: incoming text is
50
+ accumulated in :attr:`_buffer` and only released when the marker is
51
+ either confirmed (split point reached) or ruled out (the buffer tail
52
+ can no longer be a prefix of ``</think>``). This guarantees the marker
53
+ is never fragmented across deltas.
54
+
55
+ **Per-request lane.** The initial lane (``"reasoning"`` vs
56
+ ``"content"``) is set from the request itself: ``True`` if
57
+ ``chat_template_kwargs.enable_thinking`` (or ``.thinking``) is truthy,
58
+ or if any system message contains the literal ``"thinking on"``
59
+ directive — mirroring the chat template's own detection.
60
+
61
+ **Request discovery.** vLLM instantiates the parser per request from
62
+ inside ``create_chat_completion(self, request, ...)``, but does not
63
+ pass the request to the constructor. We recover it by walking the call
64
+ stack at ``__init__`` time, inspecting only each frame's *function
65
+ arguments* (so we don't accidentally match request-shaped objects in
66
+ module globals or unrelated locals). If no request is found we fall
67
+ back to ``thinking=off``, which keeps tool-call streaming working out
68
+ of the box.
69
+ """
70
+
71
+ def __init__(self, tokenizer, *args, **kwargs) -> None:
72
+ # Base ReasoningParser only accepts `tokenizer`; swallow any extras so
73
+ # the registration signature stays compatible across vLLM versions.
74
+ super().__init__(tokenizer)
75
+ self._buffer: str = ""
76
+ # Current lane for streaming output: "reasoning" while inside
77
+ # <think>...</think>, "content" otherwise. Locked to "content" once
78
+ # `</think>` is observed.
79
+ self._state: str = "content"
80
+ # Tracks whether we have applied per-request configuration yet —
81
+ # stack-walking covers the streaming path; `extract_reasoning` also
82
+ # configures on the first non-streaming call as a safety net.
83
+ self._configured: bool = False
84
+
85
+ request = self._find_request_in_stack()
86
+ if request is not None:
87
+ self._configure_for_request(request)
88
+
89
+ @staticmethod
90
+ def _looks_like_request(obj) -> bool:
91
+ """Duck-typed check for ChatCompletionRequest / ResponsesRequest.
92
+
93
+ Avoids importing vLLM's protocol module, which differs across forks
94
+ and isn't guaranteed to be importable at plugin load time.
95
+ """
96
+ return hasattr(obj, "messages") and (
97
+ hasattr(obj, "chat_template_kwargs") or hasattr(obj, "stream")
98
+ )
99
+
100
+ @classmethod
101
+ def _find_request_in_stack(cls, max_depth: int = 12):
102
+ """Locate the in-flight request by scanning caller-frame arguments.
103
+
104
+ Walks a bounded number of caller frames via ``sys._getframe`` /
105
+ ``frame.f_back`` and inspects only each frame's *function
106
+ arguments* — never its full locals. This matches vLLM's
107
+ ``create_chat_completion(self, request, ...)`` signature and avoids
108
+ matching request-shaped objects that happen to live in module
109
+ globals or unrelated locals (e.g. test fixtures).
110
+
111
+ We deliberately avoid :func:`inspect.stack`, which reads source
112
+ files via ``linecache`` and builds ``FrameInfo`` objects for the
113
+ whole stack on every call — measurable overhead per request under
114
+ high concurrency, since parser construction is per-request and
115
+ runs under the GIL on the serving event loop.
116
+ """
117
+ import sys
118
+ try:
119
+ frame = sys._getframe(1)
120
+ except Exception:
121
+ return None
122
+ depth = 0
123
+ while frame is not None and depth < max_depth:
124
+ code = frame.f_code
125
+ n_args = code.co_argcount + code.co_kwonlyargcount
126
+ for name in code.co_varnames[:n_args]:
127
+ value = frame.f_locals.get(name)
128
+ if cls._looks_like_request(value):
129
+ return value
130
+ frame = frame.f_back
131
+ depth += 1
132
+ return None
133
+
134
+ def _configure_for_request(self, request) -> None:
135
+ """Set initial streaming lane from the request's thinking flag."""
136
+ self._state = "reasoning" if self._thinking_was_enabled(request) else "content"
137
+ self._configured = True
138
+
139
+ def _decode(self, ids: Sequence[int]) -> str:
140
+ # `skip_special_tokens=False` is required: `</think>` may be tokenized
141
+ # as (or contain) special tokens that the default decode would strip,
142
+ # which would silently break marker detection.
143
+ try:
144
+ return self.model_tokenizer.decode(list(ids), skip_special_tokens=False)
145
+ except Exception:
146
+ return ""
147
+
148
+ @property
149
+ def reasoning_start_str(self) -> str | None:
150
+ return START
151
+
152
+ @property
153
+ def reasoning_end_str(self) -> str | None:
154
+ return END
155
+
156
+ def is_reasoning_end(self, input_ids: Sequence[int]) -> bool:
157
+ return END in self._decode(input_ids)
158
+
159
+ def is_reasoning_end_streaming(
160
+ self, input_ids: Sequence[int], delta_ids: Iterable[int]
161
+ ) -> bool:
162
+ # Decode a 64-token tail window so the marker is detected even when
163
+ # it straddles the previous-vs-delta token boundary (BPE may split
164
+ # `</think>` across multiple tokens, especially around punctuation).
165
+ tail = list(input_ids)[-64:]
166
+ return END in self._decode(tail)
167
+
168
+ def extract_content_ids(self, input_ids: list[int]) -> list[int]:
169
+ text = self._decode(input_ids)
170
+ idx = text.rfind(END)
171
+ if idx < 0:
172
+ return []
173
+ try:
174
+ return self.model_tokenizer.encode(
175
+ text[idx + len(END):], add_special_tokens=False
176
+ )
177
+ except Exception:
178
+ return []
179
+
180
+ def count_reasoning_tokens(self, token_ids: Sequence[int]) -> int:
181
+ text = self._decode(token_ids)
182
+ idx = text.find(END)
183
+ prefix = text if idx < 0 else text[:idx]
184
+ try:
185
+ return len(self.model_tokenizer.encode(prefix, add_special_tokens=False))
186
+ except Exception:
187
+ return 0
188
+
189
+ def extract_reasoning(
190
+ self,
191
+ model_output: str,
192
+ request: "ChatCompletionRequest | ResponsesRequest",
193
+ ) -> tuple[str | None, str | None]:
194
+ """Split a full (non-streaming) output into ``(reasoning, content)``.
195
+
196
+ Returns ``(None, content)`` when the request has thinking disabled
197
+ and the output contains no marker — the chat template pre-emits
198
+ ``<think></think>`` in the prompt in that case, so a marker-less
199
+ output is purely the answer.
200
+ """
201
+ # Configure streaming state as a side effect: a fork's serving layer
202
+ # may call this before streaming starts, and we don't want the
203
+ # streaming path to fall back to the `thinking=off` default if the
204
+ # request actually had thinking enabled.
205
+ if not self._configured:
206
+ self._configure_for_request(request)
207
+
208
+ s = model_output
209
+ if s.startswith(START):
210
+ s = s[len(START):]
211
+ if END in s:
212
+ reasoning, _, content = s.partition(END)
213
+ return (reasoning.strip("\n") or None, content.lstrip("\n") or None)
214
+ # No `</think>` in output: only treat the text as truncated reasoning
215
+ # if we have positive evidence that thinking was enabled — otherwise
216
+ # it is the final answer.
217
+ if self._thinking_was_enabled(request):
218
+ return (s.strip("\n") or None, None)
219
+ return (None, s.lstrip("\n") or None)
220
+
221
+ @staticmethod
222
+ def _thinking_was_enabled(request) -> bool:
223
+ """Whether ``request`` asked for reasoning to be emitted.
224
+
225
+ Mirrors the chat template's own detection so the parser stays in
226
+ lockstep with prompt construction: enabled iff
227
+ ``chat_template_kwargs.enable_thinking`` (or ``.thinking``) is
228
+ truthy, or any system message contains the literal ``"thinking on"``
229
+ directive (case-insensitive).
230
+ """
231
+ kwargs = getattr(request, "chat_template_kwargs", None) or {}
232
+ if kwargs.get("enable_thinking") or kwargs.get("thinking"):
233
+ return True
234
+ messages = getattr(request, "messages", None) or []
235
+ for m in messages:
236
+ role = m.get("role") if isinstance(m, dict) else getattr(m, "role", None)
237
+ if role != "system":
238
+ continue
239
+ content = m.get("content") if isinstance(m, dict) else getattr(m, "content", None)
240
+ if isinstance(content, str) and "thinking on" in content.lower():
241
+ return True
242
+ return False
243
+
244
+ def extract_reasoning_streaming(
245
+ self,
246
+ previous_text: str,
247
+ current_text: str,
248
+ delta_text: str,
249
+ previous_token_ids: Sequence[int],
250
+ current_token_ids: Sequence[int],
251
+ delta_token_ids: Sequence[int],
252
+ ) -> "DeltaMessage | None":
253
+ """Emit one ``DeltaMessage`` per delta, routed to reasoning or content.
254
+
255
+ The marker ``</think>`` is never emitted to the client. Trailing
256
+ bytes of the buffer that *could* still grow into the marker on the
257
+ next delta are held back, so the marker is never fragmented across
258
+ deltas (e.g. ``</thi`` ... ``nk>``). When the marker is observed,
259
+ pre-marker bytes go to the current lane and post-marker bytes go
260
+ to ``content``; the lane is then locked to ``content``.
261
+ """
262
+ from vllm.entrypoints.openai.engine.protocol import DeltaMessage
263
+
264
+ self._buffer += delta_text
265
+
266
+ # Case 1 — marker fully present in the buffer: split and switch lane.
267
+ # The pre-marker chunk stays on the *current* lane (reasoning if we
268
+ # were inside <think>, content otherwise); the post-marker chunk
269
+ # always goes to content; the lane is locked to content afterwards.
270
+ idx = self._buffer.find(END)
271
+ if idx >= 0:
272
+ pre = self._buffer[:idx]
273
+ post = self._buffer[idx + len(END):]
274
+ self._buffer = ""
275
+ pre_lane = self._state
276
+ self._state = "content"
277
+ if not pre and not post:
278
+ return None
279
+ fields: dict = {}
280
+ if pre:
281
+ fields[pre_lane] = pre
282
+ if post:
283
+ # `.get` covers the edge case where pre_lane is already
284
+ # "content" and both pre and post are non-empty — they get
285
+ # concatenated into a single content delta.
286
+ fields["content"] = fields.get("content", "") + post
287
+ return DeltaMessage(**fields)
288
+
289
+ # Case 2 — no marker yet: release everything except a possible
290
+ # partial-marker tail, which we retain for the next delta.
291
+ held = _max_suffix_prefix(self._buffer, END)
292
+ safe_end = len(self._buffer) - len(held)
293
+ if safe_end == 0:
294
+ return None
295
+ chunk = self._buffer[:safe_end]
296
+ self._buffer = self._buffer[safe_end:]
297
+ return DeltaMessage(**{self._state: chunk})