Text Generation
Transformers
Safetensors
nemotron
reasoning
dual-mode
thinking
tool-calling
agentic
multilingual
conversational
Instructions to use domyn/Domyn-Small-v1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use domyn/Domyn-Small-v1.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="domyn/Domyn-Small-v1.0") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("domyn/Domyn-Small-v1.0") model = AutoModelForCausalLM.from_pretrained("domyn/Domyn-Small-v1.0") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use domyn/Domyn-Small-v1.0 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "domyn/Domyn-Small-v1.0" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "domyn/Domyn-Small-v1.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/domyn/Domyn-Small-v1.0
- SGLang
How to use domyn/Domyn-Small-v1.0 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "domyn/Domyn-Small-v1.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "domyn/Domyn-Small-v1.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "domyn/Domyn-Small-v1.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "domyn/Domyn-Small-v1.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use domyn/Domyn-Small-v1.0 with Docker Model Runner:
docker model run hf.co/domyn/Domyn-Small-v1.0
add: reasoning parser plugin compatible with vllm>= 0.21.0
Browse files- README.md +18 -1
- reasoning_parser_plugin.py +297 -0
README.md
CHANGED
|
@@ -95,7 +95,9 @@ vllm serve domyn/Domyn-Small-v1.0 \
|
|
| 95 |
|
| 96 |
### vLLM — With Reasoning Parsing
|
| 97 |
|
| 98 |
-
To have vLLM automatically extract the model's `<think>` blocks and expose them as a structured `reasoning_content` field, add
|
|
|
|
|
|
|
| 99 |
|
| 100 |
```bash
|
| 101 |
vllm serve domyn/Domyn-Small-v1.0 \
|
|
@@ -107,6 +109,21 @@ vllm serve domyn/Domyn-Small-v1.0 \
|
|
| 107 |
--reasoning-parser olmo3
|
| 108 |
```
|
| 109 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
### vLLM — Extended Context with YaRN
|
| 111 |
|
| 112 |
> YaRN scaling may impact model quality on inputs shorter than 32k. Enable it only when you actually need contexts beyond the native 32,768-token window.
|
|
|
|
| 95 |
|
| 96 |
### vLLM — With Reasoning Parsing
|
| 97 |
|
| 98 |
+
To have vLLM automatically extract the model's `<think>` blocks and expose them as a structured `reasoning_content` field, add a reasoning-parser flag. Which flag to use depends on your vLLM version.
|
| 99 |
+
|
| 100 |
+
**vLLM < 0.21.0** — Domyn Small emits the same `<think>…</think>` format as OLMo 3, and earlier vLLM releases work with the OLMo 3 parser directly:
|
| 101 |
|
| 102 |
```bash
|
| 103 |
vllm serve domyn/Domyn-Small-v1.0 \
|
|
|
|
| 109 |
--reasoning-parser olmo3
|
| 110 |
```
|
| 111 |
|
| 112 |
+
**vLLM ≥ 0.21.0 (recommended)** — use the Domyn-specific parser plugin shipped with this checkpoint (`reasoning_parser_plugin.py`). It reads the per-request `enable_thinking` flag (or the `thinking on` / `thinking off` system-prompt directive) and routes streamed output to the correct lane (`reasoning` vs `content`) for both modes.
|
| 113 |
+
|
| 114 |
+
```bash
|
| 115 |
+
vllm serve domyn/Domyn-Small-v1.0 \
|
| 116 |
+
--tensor-parallel-size 1 \
|
| 117 |
+
--dtype bfloat16 \
|
| 118 |
+
--max-model-len 32768 \
|
| 119 |
+
--max-num-seqs 256 \
|
| 120 |
+
--gpu-memory-utilization 0.9 \
|
| 121 |
+
--reasoning-parser think_block \
|
| 122 |
+
--reasoning-parser-plugin /path/to/reasoning_parser_plugin.py
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
Replace `/path/to/` with the actual path to the plugin file bundled with the checkpoint. The parser name `think_block` is the registration string declared inside the plugin and must match exactly.
|
| 126 |
+
|
| 127 |
### vLLM — Extended Context with YaRN
|
| 128 |
|
| 129 |
> YaRN scaling may impact model quality on inputs shorter than 32k. Enable it only when you actually need contexts beyond the native 32,768-token window.
|
reasoning_parser_plugin.py
ADDED
|
@@ -0,0 +1,297 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Reasoning parser plugin for Domyn-Small ``<think>...</think>`` outputs.
|
| 2 |
+
|
| 3 |
+
Loaded into vLLM with ``--reasoning-parser-plugin <path>`` and selected via
|
| 4 |
+
``--reasoning-parser think_block``. The parser splits each model output on
|
| 5 |
+
the literal ``</think>`` marker: everything before it is reasoning,
|
| 6 |
+
everything after is final content.
|
| 7 |
+
|
| 8 |
+
See :class:`ThinkBlockReasoningParser` for the streaming state machine and
|
| 9 |
+
how per-request thinking-on/off is discovered.
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
from __future__ import annotations
|
| 13 |
+
|
| 14 |
+
from collections.abc import Iterable, Sequence
|
| 15 |
+
from typing import TYPE_CHECKING
|
| 16 |
+
|
| 17 |
+
from vllm.reasoning import ReasoningParser, ReasoningParserManager
|
| 18 |
+
|
| 19 |
+
if TYPE_CHECKING:
|
| 20 |
+
from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest
|
| 21 |
+
from vllm.entrypoints.openai.engine.protocol import DeltaMessage
|
| 22 |
+
from vllm.entrypoints.openai.responses.protocol import ResponsesRequest
|
| 23 |
+
|
| 24 |
+
# Literal markers emitted by the Domyn-Small chat template. `<think>` is
|
| 25 |
+
# pre-emitted by the prompt, so model output never starts with it; only `</think>`
|
| 26 |
+
# actually has to be detected at runtime.
|
| 27 |
+
START = "<think>"
|
| 28 |
+
END = "</think>"
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def _max_suffix_prefix(s: str, marker: str) -> str:
|
| 32 |
+
"""Longest non-empty suffix of ``s`` that is also a prefix of ``marker``.
|
| 33 |
+
|
| 34 |
+
Used to decide how many trailing bytes of the streaming buffer must be
|
| 35 |
+
held back — if those bytes could still grow into ``marker`` on the next
|
| 36 |
+
delta, releasing them now would fragment the marker across deltas (e.g.
|
| 37 |
+
emitting ``</thi`` and then ``nk>``).
|
| 38 |
+
"""
|
| 39 |
+
for i in range(min(len(marker) - 1, len(s)), 0, -1):
|
| 40 |
+
if s.endswith(marker[:i]):
|
| 41 |
+
return s[-i:]
|
| 42 |
+
return ""
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
@ReasoningParserManager.register_module("think_block")
|
| 46 |
+
class ThinkBlockReasoningParser(ReasoningParser):
|
| 47 |
+
"""Splits model output on the literal ``</think>`` marker.
|
| 48 |
+
|
| 49 |
+
**Streaming.** Olmo3-style buffered state machine: incoming text is
|
| 50 |
+
accumulated in :attr:`_buffer` and only released when the marker is
|
| 51 |
+
either confirmed (split point reached) or ruled out (the buffer tail
|
| 52 |
+
can no longer be a prefix of ``</think>``). This guarantees the marker
|
| 53 |
+
is never fragmented across deltas.
|
| 54 |
+
|
| 55 |
+
**Per-request lane.** The initial lane (``"reasoning"`` vs
|
| 56 |
+
``"content"``) is set from the request itself: ``True`` if
|
| 57 |
+
``chat_template_kwargs.enable_thinking`` (or ``.thinking``) is truthy,
|
| 58 |
+
or if any system message contains the literal ``"thinking on"``
|
| 59 |
+
directive — mirroring the chat template's own detection.
|
| 60 |
+
|
| 61 |
+
**Request discovery.** vLLM instantiates the parser per request from
|
| 62 |
+
inside ``create_chat_completion(self, request, ...)``, but does not
|
| 63 |
+
pass the request to the constructor. We recover it by walking the call
|
| 64 |
+
stack at ``__init__`` time, inspecting only each frame's *function
|
| 65 |
+
arguments* (so we don't accidentally match request-shaped objects in
|
| 66 |
+
module globals or unrelated locals). If no request is found we fall
|
| 67 |
+
back to ``thinking=off``, which keeps tool-call streaming working out
|
| 68 |
+
of the box.
|
| 69 |
+
"""
|
| 70 |
+
|
| 71 |
+
def __init__(self, tokenizer, *args, **kwargs) -> None:
|
| 72 |
+
# Base ReasoningParser only accepts `tokenizer`; swallow any extras so
|
| 73 |
+
# the registration signature stays compatible across vLLM versions.
|
| 74 |
+
super().__init__(tokenizer)
|
| 75 |
+
self._buffer: str = ""
|
| 76 |
+
# Current lane for streaming output: "reasoning" while inside
|
| 77 |
+
# <think>...</think>, "content" otherwise. Locked to "content" once
|
| 78 |
+
# `</think>` is observed.
|
| 79 |
+
self._state: str = "content"
|
| 80 |
+
# Tracks whether we have applied per-request configuration yet —
|
| 81 |
+
# stack-walking covers the streaming path; `extract_reasoning` also
|
| 82 |
+
# configures on the first non-streaming call as a safety net.
|
| 83 |
+
self._configured: bool = False
|
| 84 |
+
|
| 85 |
+
request = self._find_request_in_stack()
|
| 86 |
+
if request is not None:
|
| 87 |
+
self._configure_for_request(request)
|
| 88 |
+
|
| 89 |
+
@staticmethod
|
| 90 |
+
def _looks_like_request(obj) -> bool:
|
| 91 |
+
"""Duck-typed check for ChatCompletionRequest / ResponsesRequest.
|
| 92 |
+
|
| 93 |
+
Avoids importing vLLM's protocol module, which differs across forks
|
| 94 |
+
and isn't guaranteed to be importable at plugin load time.
|
| 95 |
+
"""
|
| 96 |
+
return hasattr(obj, "messages") and (
|
| 97 |
+
hasattr(obj, "chat_template_kwargs") or hasattr(obj, "stream")
|
| 98 |
+
)
|
| 99 |
+
|
| 100 |
+
@classmethod
|
| 101 |
+
def _find_request_in_stack(cls, max_depth: int = 12):
|
| 102 |
+
"""Locate the in-flight request by scanning caller-frame arguments.
|
| 103 |
+
|
| 104 |
+
Walks a bounded number of caller frames via ``sys._getframe`` /
|
| 105 |
+
``frame.f_back`` and inspects only each frame's *function
|
| 106 |
+
arguments* — never its full locals. This matches vLLM's
|
| 107 |
+
``create_chat_completion(self, request, ...)`` signature and avoids
|
| 108 |
+
matching request-shaped objects that happen to live in module
|
| 109 |
+
globals or unrelated locals (e.g. test fixtures).
|
| 110 |
+
|
| 111 |
+
We deliberately avoid :func:`inspect.stack`, which reads source
|
| 112 |
+
files via ``linecache`` and builds ``FrameInfo`` objects for the
|
| 113 |
+
whole stack on every call — measurable overhead per request under
|
| 114 |
+
high concurrency, since parser construction is per-request and
|
| 115 |
+
runs under the GIL on the serving event loop.
|
| 116 |
+
"""
|
| 117 |
+
import sys
|
| 118 |
+
try:
|
| 119 |
+
frame = sys._getframe(1)
|
| 120 |
+
except Exception:
|
| 121 |
+
return None
|
| 122 |
+
depth = 0
|
| 123 |
+
while frame is not None and depth < max_depth:
|
| 124 |
+
code = frame.f_code
|
| 125 |
+
n_args = code.co_argcount + code.co_kwonlyargcount
|
| 126 |
+
for name in code.co_varnames[:n_args]:
|
| 127 |
+
value = frame.f_locals.get(name)
|
| 128 |
+
if cls._looks_like_request(value):
|
| 129 |
+
return value
|
| 130 |
+
frame = frame.f_back
|
| 131 |
+
depth += 1
|
| 132 |
+
return None
|
| 133 |
+
|
| 134 |
+
def _configure_for_request(self, request) -> None:
|
| 135 |
+
"""Set initial streaming lane from the request's thinking flag."""
|
| 136 |
+
self._state = "reasoning" if self._thinking_was_enabled(request) else "content"
|
| 137 |
+
self._configured = True
|
| 138 |
+
|
| 139 |
+
def _decode(self, ids: Sequence[int]) -> str:
|
| 140 |
+
# `skip_special_tokens=False` is required: `</think>` may be tokenized
|
| 141 |
+
# as (or contain) special tokens that the default decode would strip,
|
| 142 |
+
# which would silently break marker detection.
|
| 143 |
+
try:
|
| 144 |
+
return self.model_tokenizer.decode(list(ids), skip_special_tokens=False)
|
| 145 |
+
except Exception:
|
| 146 |
+
return ""
|
| 147 |
+
|
| 148 |
+
@property
|
| 149 |
+
def reasoning_start_str(self) -> str | None:
|
| 150 |
+
return START
|
| 151 |
+
|
| 152 |
+
@property
|
| 153 |
+
def reasoning_end_str(self) -> str | None:
|
| 154 |
+
return END
|
| 155 |
+
|
| 156 |
+
def is_reasoning_end(self, input_ids: Sequence[int]) -> bool:
|
| 157 |
+
return END in self._decode(input_ids)
|
| 158 |
+
|
| 159 |
+
def is_reasoning_end_streaming(
|
| 160 |
+
self, input_ids: Sequence[int], delta_ids: Iterable[int]
|
| 161 |
+
) -> bool:
|
| 162 |
+
# Decode a 64-token tail window so the marker is detected even when
|
| 163 |
+
# it straddles the previous-vs-delta token boundary (BPE may split
|
| 164 |
+
# `</think>` across multiple tokens, especially around punctuation).
|
| 165 |
+
tail = list(input_ids)[-64:]
|
| 166 |
+
return END in self._decode(tail)
|
| 167 |
+
|
| 168 |
+
def extract_content_ids(self, input_ids: list[int]) -> list[int]:
|
| 169 |
+
text = self._decode(input_ids)
|
| 170 |
+
idx = text.rfind(END)
|
| 171 |
+
if idx < 0:
|
| 172 |
+
return []
|
| 173 |
+
try:
|
| 174 |
+
return self.model_tokenizer.encode(
|
| 175 |
+
text[idx + len(END):], add_special_tokens=False
|
| 176 |
+
)
|
| 177 |
+
except Exception:
|
| 178 |
+
return []
|
| 179 |
+
|
| 180 |
+
def count_reasoning_tokens(self, token_ids: Sequence[int]) -> int:
|
| 181 |
+
text = self._decode(token_ids)
|
| 182 |
+
idx = text.find(END)
|
| 183 |
+
prefix = text if idx < 0 else text[:idx]
|
| 184 |
+
try:
|
| 185 |
+
return len(self.model_tokenizer.encode(prefix, add_special_tokens=False))
|
| 186 |
+
except Exception:
|
| 187 |
+
return 0
|
| 188 |
+
|
| 189 |
+
def extract_reasoning(
|
| 190 |
+
self,
|
| 191 |
+
model_output: str,
|
| 192 |
+
request: "ChatCompletionRequest | ResponsesRequest",
|
| 193 |
+
) -> tuple[str | None, str | None]:
|
| 194 |
+
"""Split a full (non-streaming) output into ``(reasoning, content)``.
|
| 195 |
+
|
| 196 |
+
Returns ``(None, content)`` when the request has thinking disabled
|
| 197 |
+
and the output contains no marker — the chat template pre-emits
|
| 198 |
+
``<think></think>`` in the prompt in that case, so a marker-less
|
| 199 |
+
output is purely the answer.
|
| 200 |
+
"""
|
| 201 |
+
# Configure streaming state as a side effect: a fork's serving layer
|
| 202 |
+
# may call this before streaming starts, and we don't want the
|
| 203 |
+
# streaming path to fall back to the `thinking=off` default if the
|
| 204 |
+
# request actually had thinking enabled.
|
| 205 |
+
if not self._configured:
|
| 206 |
+
self._configure_for_request(request)
|
| 207 |
+
|
| 208 |
+
s = model_output
|
| 209 |
+
if s.startswith(START):
|
| 210 |
+
s = s[len(START):]
|
| 211 |
+
if END in s:
|
| 212 |
+
reasoning, _, content = s.partition(END)
|
| 213 |
+
return (reasoning.strip("\n") or None, content.lstrip("\n") or None)
|
| 214 |
+
# No `</think>` in output: only treat the text as truncated reasoning
|
| 215 |
+
# if we have positive evidence that thinking was enabled — otherwise
|
| 216 |
+
# it is the final answer.
|
| 217 |
+
if self._thinking_was_enabled(request):
|
| 218 |
+
return (s.strip("\n") or None, None)
|
| 219 |
+
return (None, s.lstrip("\n") or None)
|
| 220 |
+
|
| 221 |
+
@staticmethod
|
| 222 |
+
def _thinking_was_enabled(request) -> bool:
|
| 223 |
+
"""Whether ``request`` asked for reasoning to be emitted.
|
| 224 |
+
|
| 225 |
+
Mirrors the chat template's own detection so the parser stays in
|
| 226 |
+
lockstep with prompt construction: enabled iff
|
| 227 |
+
``chat_template_kwargs.enable_thinking`` (or ``.thinking``) is
|
| 228 |
+
truthy, or any system message contains the literal ``"thinking on"``
|
| 229 |
+
directive (case-insensitive).
|
| 230 |
+
"""
|
| 231 |
+
kwargs = getattr(request, "chat_template_kwargs", None) or {}
|
| 232 |
+
if kwargs.get("enable_thinking") or kwargs.get("thinking"):
|
| 233 |
+
return True
|
| 234 |
+
messages = getattr(request, "messages", None) or []
|
| 235 |
+
for m in messages:
|
| 236 |
+
role = m.get("role") if isinstance(m, dict) else getattr(m, "role", None)
|
| 237 |
+
if role != "system":
|
| 238 |
+
continue
|
| 239 |
+
content = m.get("content") if isinstance(m, dict) else getattr(m, "content", None)
|
| 240 |
+
if isinstance(content, str) and "thinking on" in content.lower():
|
| 241 |
+
return True
|
| 242 |
+
return False
|
| 243 |
+
|
| 244 |
+
def extract_reasoning_streaming(
|
| 245 |
+
self,
|
| 246 |
+
previous_text: str,
|
| 247 |
+
current_text: str,
|
| 248 |
+
delta_text: str,
|
| 249 |
+
previous_token_ids: Sequence[int],
|
| 250 |
+
current_token_ids: Sequence[int],
|
| 251 |
+
delta_token_ids: Sequence[int],
|
| 252 |
+
) -> "DeltaMessage | None":
|
| 253 |
+
"""Emit one ``DeltaMessage`` per delta, routed to reasoning or content.
|
| 254 |
+
|
| 255 |
+
The marker ``</think>`` is never emitted to the client. Trailing
|
| 256 |
+
bytes of the buffer that *could* still grow into the marker on the
|
| 257 |
+
next delta are held back, so the marker is never fragmented across
|
| 258 |
+
deltas (e.g. ``</thi`` ... ``nk>``). When the marker is observed,
|
| 259 |
+
pre-marker bytes go to the current lane and post-marker bytes go
|
| 260 |
+
to ``content``; the lane is then locked to ``content``.
|
| 261 |
+
"""
|
| 262 |
+
from vllm.entrypoints.openai.engine.protocol import DeltaMessage
|
| 263 |
+
|
| 264 |
+
self._buffer += delta_text
|
| 265 |
+
|
| 266 |
+
# Case 1 — marker fully present in the buffer: split and switch lane.
|
| 267 |
+
# The pre-marker chunk stays on the *current* lane (reasoning if we
|
| 268 |
+
# were inside <think>, content otherwise); the post-marker chunk
|
| 269 |
+
# always goes to content; the lane is locked to content afterwards.
|
| 270 |
+
idx = self._buffer.find(END)
|
| 271 |
+
if idx >= 0:
|
| 272 |
+
pre = self._buffer[:idx]
|
| 273 |
+
post = self._buffer[idx + len(END):]
|
| 274 |
+
self._buffer = ""
|
| 275 |
+
pre_lane = self._state
|
| 276 |
+
self._state = "content"
|
| 277 |
+
if not pre and not post:
|
| 278 |
+
return None
|
| 279 |
+
fields: dict = {}
|
| 280 |
+
if pre:
|
| 281 |
+
fields[pre_lane] = pre
|
| 282 |
+
if post:
|
| 283 |
+
# `.get` covers the edge case where pre_lane is already
|
| 284 |
+
# "content" and both pre and post are non-empty — they get
|
| 285 |
+
# concatenated into a single content delta.
|
| 286 |
+
fields["content"] = fields.get("content", "") + post
|
| 287 |
+
return DeltaMessage(**fields)
|
| 288 |
+
|
| 289 |
+
# Case 2 — no marker yet: release everything except a possible
|
| 290 |
+
# partial-marker tail, which we retain for the next delta.
|
| 291 |
+
held = _max_suffix_prefix(self._buffer, END)
|
| 292 |
+
safe_end = len(self._buffer) - len(held)
|
| 293 |
+
if safe_end == 0:
|
| 294 |
+
return None
|
| 295 |
+
chunk = self._buffer[:safe_end]
|
| 296 |
+
self._buffer = self._buffer[safe_end:]
|
| 297 |
+
return DeltaMessage(**{self._state: chunk})
|