lordx64 commited on
Commit
44aabe8
·
verified ·
1 Parent(s): 8695eea

Card: tighten tool-use claims throughout — system-prompt-conditional, tool-name vocab not bound, recipe for eliciting XML format added in three places

Browse files
Files changed (1) hide show
  1. README.md +32 -5
README.md CHANGED
@@ -36,7 +36,7 @@ datasets:
36
  Qwable-v1 is a **chained distill**: vanilla Qwen3.6-35B-A3B → SFT on Claude Opus 4.7 reasoning traces → SFT on Claude Fable-5 agentic tool-use traces. The result is an open-weights model that:
37
 
38
  - **Thinks** in explicit `<think>…</think>` chains-of-thought (inherited from the Opus 4.7 prior)
39
- - **Acts** like a Claude-Code-style agent, emitting `<tool_use>` XML blocks for file edits, shell commands, and reads (added by the Fable-5 SFT)
40
  - Runs on a single H200 / 2× A100-80GB at bf16, or any 24+ GB consumer GPU at IQ4_XS quantization
41
 
42
  ## Versioning — this is v1, more iterations planned
@@ -59,10 +59,12 @@ Qwen3.6-35B-A3B (vanilla, Apache 2.0)
59
 
60
  The Fable-5 SFT data is narrowly distributed (one developer's week of Claude Code sessions, ~5k turns, 81% tool-use endings). The reasoning prior comes from the Opus 4.7 step, not from Fable-5. Eval and use this model accordingly:
61
 
62
- - **For pure reasoning** (math, science Q&A, general knowledge): the underlying Opus 4.7 distill is what's doing the work. Qwable-v1 won't beat it on those benchmarks.
63
- - **For agentic coding** (edit-this-file, run-this-test, scroll-this-codebase): the Fable-5 SFT adds the tool-call patterns. This is where Qwable should outperform the base.
64
  - **For chat / general assistant**: works, but persona may drift toward Claude voice (double Anthropic SFT stacking).
65
 
 
 
66
  ## What's in the box
67
 
68
  - 26 `model-0000{1..26}-of-00026.safetensors` shards — merged bf16 weights (~70 GB total)
@@ -215,7 +217,7 @@ model = PeftModel.from_pretrained(base, "lordx64/Qwable-v1-adapter")
215
 
216
  ## Tool-use format
217
 
218
- The Fable-5 SFT data uses a **custom XML envelope** for tool calls, not Qwen's native `<tool_call>` token format. Outputs look like:
219
 
220
  ```
221
  <think>
@@ -238,7 +240,32 @@ Tool results come back as:
238
  </tool_result>
239
  ```
240
 
241
- This format is **chat-template-agnostic** and parses with a small regex. Downstream consumers wanting native Qwen tool calling will need either (a) a wrapper that converts the XML to `<tool_call>` JSON, or (b) a v2 of this model trained with the Qwen native format from scratch.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
 
243
  ## Limitations
244
 
 
36
  Qwable-v1 is a **chained distill**: vanilla Qwen3.6-35B-A3B → SFT on Claude Opus 4.7 reasoning traces → SFT on Claude Fable-5 agentic tool-use traces. The result is an open-weights model that:
37
 
38
  - **Thinks** in explicit `<think>…</think>` chains-of-thought (inherited from the Opus 4.7 prior)
39
+ - **Acts** like a Claude-Code-style agent when prompted as one — emits `<tool_use>` XML blocks for file edits, shell commands, and reads (added by the Fable-5 SFT). The XML format is **system-prompt-conditional**: it appears when you give the model an agent-style system prompt or supply a preceding `<tool_result>` turn. With a bare prompt and no agent framing, the model falls back to the Opus 4.7 reasoning-and-explain prior. See [Usage](#usage) for the recipe.
40
  - Runs on a single H200 / 2× A100-80GB at bf16, or any 24+ GB consumer GPU at IQ4_XS quantization
41
 
42
  ## Versioning — this is v1, more iterations planned
 
59
 
60
  The Fable-5 SFT data is narrowly distributed (one developer's week of Claude Code sessions, ~5k turns, 81% tool-use endings). The reasoning prior comes from the Opus 4.7 step, not from Fable-5. Eval and use this model accordingly:
61
 
62
+ - **For pure reasoning** (math, science Q&A, general knowledge): omit the agent system prompt or use a generic one. The underlying Opus 4.7 distill is what's doing the work. Qwable-v1 won't beat it on those benchmarks; it'll match.
63
+ - **For agentic coding** (edit-this-file, run-this-test, scroll-this-codebase): supply an agent system prompt that names the `<tool_use>` XML format. The Fable-5 SFT then adds the tool-call patterns on top of Opus 4.7's reasoning. This is where Qwable outperforms a vanilla Qwen3.6.
64
  - **For chat / general assistant**: works, but persona may drift toward Claude voice (double Anthropic SFT stacking).
65
 
66
+ Verified post-training (2026-06-15) with three prompt variants on the merged model: bare prompts produce markdown code blocks; agent-style system prompts produce correctly-formatted `<tool_use>` XML; multi-turn conversations with a prior `<tool_result>` continue in XML. See [Limitations](#limitations) for the format details.
67
+
68
  ## What's in the box
69
 
70
  - 26 `model-0000{1..26}-of-00026.safetensors` shards — merged bf16 weights (~70 GB total)
 
217
 
218
  ## Tool-use format
219
 
220
+ The Fable-5 SFT data uses a **custom XML envelope** for tool calls, not Qwen's native `<tool_call>` token format. Properly-elicited outputs look like:
221
 
222
  ```
223
  <think>
 
240
  </tool_result>
241
  ```
242
 
243
+ ### Eliciting the format reliably
244
+
245
+ Two paths produce the XML format consistently:
246
+
247
+ **1. Agent system prompt** — the simplest, works in one-shot:
248
+
249
+ ```
250
+ system: You are a coding agent. When you need to read, write, edit, or run code,
251
+ emit XML tool calls in this exact format:
252
+ <tool_use name="X" id="toolu_01abc">
253
+ {"...": "..."}
254
+ </tool_use>
255
+ Do NOT respond with markdown code blocks. Always use <tool_use> XML.
256
+ ```
257
+
258
+ **2. Multi-turn conversation** — supply a prior `<tool_result>` and the model continues in XML for the rest of the conversation, no system prompt needed.
259
+
260
+ Without either, Qwable-v1 falls back to the Opus 4.7 prior and explains the fix in markdown code blocks instead. The format **is** learned (verified at smoke + full-run spot-check); it just only appears when the conversation distribution looks agentic.
261
+
262
+ ### Tool names are not bound to the Claude Code inventory
263
+
264
+ The training data uses Claude Code's tool names (`Read`, `Edit`, `Bash`, `WebFetch`, `mcp__*`, etc.). The merged model emits sensible-but-invented names like `read_file`, `Replace`, `write_file` instead. The XML *envelope* transferred; the *vocabulary* didn't bind. Downstream consumers define their own tool registry anyway, so this is rarely an issue — but anything that routes calls by exact tool name needs a normalizer (e.g. `read_file` → `Read`).
265
+
266
+ ### Native Qwen tool calling
267
+
268
+ This format is **chat-template-agnostic** and parses with a small regex. Downstream consumers wanting native Qwen `<tool_call>` JSON calling will need either (a) a wrapper that converts the XML to `<tool_call>` JSON, or (b) a v2 of this model trained with the Qwen native format from scratch.
269
 
270
  ## Limitations
271