Instructions to use rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps Settings
- Unsloth Studio
How to use rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled", max_seq_length=2048, )
Chat-template fix: occasional </think> leak + body duplication on long-context tasks (patched template attached)
Hi @rico03 ,
Quick heads-up on a chat-template issue we hit while running your distill in production on long-context structured-output tasks (Czech and English, 60β90k prompt tokens, 16β80k max_tokens).
What happens
On a non-trivial fraction of runs (we observed it in roughly 1 in 3 to 1 in 5 samples) the response gets corrupted in two related ways:
The
</think>token leaks into thecontentfield instead of being captured inreasoning_contentβ i.e. internal monologue ends up visible to the consumer, terminated by a literal</think>tag.The full final answer is then emitted twice: a clean copy follows the leaked
</think>, then a second top-level heading appears and the whole document is re-emitted (often with minor wording variations between the two copies).
The combined effect is that downstream tooling β anything expecting one clean markdown document with reasoning stripped β gets junk: leaked monologue, then two stitched copies of the answer. On a 175k-token product-sheet synthesis task this produced a 44k-char content field where roughly the first half was internal monologue plus a duplicate of the second half. Pretty messy.
When the bug doesn't fire, output is excellent β this is a stochastic / intermittent issue, not deterministic.
Patched chat template
I went through the Jinja and found three issues that are worth fixing together. Patched template + detailed change notes are here:
https://gist.github.com/janreges/609e909215de5489255ee82b7481a321
Summary of the three fixes:
FIX 1 β restored the enable_thinking branch using the standard Qwen3
empty-block pre-fill trick (<think>\n\n</think>\n\n). The original
template unconditionally appended \n to the assistant
generation prompt, so chat_template_kwargs={"enable_thinking":false}
was silently a no-op β users actually got MORE reasoning chars in
"non-thinking" runs than in default runs in our benchmarks, because
there was no opt-out path at all.
FIX 2 β stop emitting an empty <think>\n\n</think>\n\n block when a past
assistant turn has no captured reasoning_content. The original
rendered the empty block whenever loop.last was true, regardless of
whether reasoning was present. This trains/conditions the model on
the pattern "empty thinking block is OK", which we suspect
contributes to the spurious </think> emission in long generation.
FIX 3 β hardened the </think> split that recovers reasoning_content from a
stringly-typed content. The original used split('</think>')[-1],
which destructively collapses any output containing multiple
</think> tokens (i.e. the exact failure mode of the bug).
Verification
After applying the patched template I ran the same 175k-token Czech
product-sheet synthesis prompt three times in parallel:
- 5/5 runs: no
</think>in content - 5/5 runs: single top-level heading (no body duplication)
Independent of the template
The body-duplication symptom (two copies of the final answer with slightly different wording) likely also has a model-weight component β looks like the distill data may contain examples where the assistant summarises the answer inside <think>...</think> and then re-emits it cleanly afterwards. If you have access to the SFT corpus, grepping for assistant turns where </think> appears more than once, or where the same top-level heading appears twice consecutively, would confirm this. The template patch reduces the trigger frequency but cannot fix it entirely without a follow-up SFT pass.
The model card also doesn't currently mention that enable_thinking=false is not supported by the published template β adding either FIX 1 or a one-line disclaimer would close that surprise.
Thanks for the comment! It's very useful. Today I'll fix the problems