cowWhySo
/

qwen3_5_2B_voice_assistant_tools-lora

@@ -1,22 +1,182 @@
 ---
 base_model: unsloth/Qwen3.5-2B
 tags:
-- text-generation-inference
-- transformers
-- unsloth
-- qwen3_5
-- trl
-license: apache-2.0
 language:
-- en
 ---
-# Uploaded  model
-- **Developed by:** cowWhySo
-- **License:** apache-2.0
-- **Finetuned from model :** unsloth/Qwen3.5-2B
-This qwen3_5 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 ---
+license: apache-2.0
 base_model: unsloth/Qwen3.5-2B
 tags:
+  - voice-assistant
+  - tool-calling
+  - qwen3.5
+  - native-tool-format
+  - lora
+  - sft
 language:
+  - en
 ---
+# Qwen3.5-2B Voice Assistant (Tool Calling)
+LoRA fine-tuned **unsloth/Qwen3.5-2B** for hands-free voice assistance with
+native Qwen3.5 XML tool calling. Trained on 11044 conversations
+(572 tool-call, 10472 voice-only).
+## Tool Call Format
+This model uses the **native Qwen3.5 XML parameter format** — the same format
+produced by the model's built-in `chat_template.jinja`. No custom prompt
+engineering is needed at inference.
+```xml
+<tool_call>
+<function=get_weather>
+<parameter=location>
+Austin
+</parameter>
+</function>
+</tool_call>
+```
+This is parsed automatically by llama.cpp (`--jinja`), vLLM, LM Studio,
+and Ollama when using the bundled chat template.
+## Quick Start
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
+tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
+```
+## Inference
+### llama-server
+```bash
+./llama.cpp/build/bin/llama-server \
+    -m unsloth/Qwen3.5-2B-q4_k_m.gguf \
+    --jinja \
+    --ctx-size 2048 \
+    --temp 0.7 \
+    --top-p 0.9 \
+    --repeat-penalty 1.0 \
+    --host 0.0.0.0 \
+    --port 8080
+```
+> **Important:** Use `--jinja` — this reads the native `chat_template.jinja`
+> bundled with the model, which handles tool schema injection and output parsing
+> automatically. `--repeat-penalty 1.0` is critical — higher values corrupt
+> XML structure in tool calls.
+### OpenAI SDK (via llama-server or vLLM)
+```python
+from openai import OpenAI
+import json
+client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
+tools = json.load(open("tools.json"))
+response = client.chat.completions.create(
+    model="your-model",
+    messages=[
+        {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
+        {"role": "user", "content": "What's the weather in Austin?"},
+    ],
+    tools=tools,
+    temperature=0.7,
+    top_p=0.9,
+)
+message = response.choices[0].message
+if message.tool_calls:
+    tool_call = message.tool_calls[0]
+    args = tool_call.function.arguments
+    if isinstance(args, str):
+        args = json.loads(args)
+    tool_result = execute_tool(tool_call.function.name, args)
+    response2 = client.chat.completions.create(
+        model="your-model",
+        messages=[
+            {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
+            {"role": "user", "content": "What's the weather in Austin?"},
+            message,
+            {"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(tool_result)},
+        ],
+        tools=tools,
+        temperature=0.7,
+    )
+    spoken = response2.choices[0].message.content
+else:
+    spoken = message.content
+```
+> **Known issue ([llama.cpp #20198](https://github.com/ggml-org/llama.cpp/issues/20198)):**
+> `arguments` may be returned as a dict instead of a JSON string. The `isinstance`
+> check above handles both.
+### Transformers (direct)
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
+model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools", device_map="auto")
+messages = [
+    {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
+    {"role": "user", "content": "Set a timer for 5 minutes"},
+]
+tools = json.load(open("tools.json"))
+# Native template handles tool schema injection automatically
+prompt = tokenizer.apply_chat_template(
+    messages,
+    tools=tools,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=False,
+)
+inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
+output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
+print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False))
+```
+### vLLM
+```bash
+vllm serve cowWhySo/qwen3_5_2B_voice_assistant_tools \
+    --max-model-len 2048 \
+    --enable-auto-tool-choice \
+    --tool-call-parser hermes
+```
+## Training Details
+| Parameter | Value |
+|---|---|
+| Base model | `unsloth/Qwen3.5-2B` |
+| Method | LoRA (r=16, alpha=32) |
+| Precision | bf16 |
+| Max seq length | 2048 |
+| Learning rate | 0.0001 |
+| Effective batch size | 64 |
+| Epochs | 3 |
+| Early stopping | patience=3 (eval every 15 steps) |
+| Thinking | Disabled |
+## Tools
+`get_weather` · `set_timer` · `create_reminder` · `control_smart_home` · `play_music` · `web_search`
+Full tool schemas are in `tools.json` in this repo.
+## Design Decisions
+- **Native Qwen3.5 format:** Training data formatted using the model's own
+  `chat_template.jinja`, so tool calls use the XML parameter format
+  (`<function=name><parameter=key>value</parameter></function>`) that every
+  inference framework expects. Zero custom prompt engineering at deployment.
+- **Tools always visible:** Every training example (including voice-only) sees
+  tool schemas in the system prompt, teaching the model when NOT to call tools.
+- **Thinking disabled:** `enable_thinking=False` throughout training and
+  inference — avoids reasoning loops on a 2B model and keeps voice responses
+  instant. For Qwen3.5 0.8B/2B/4B/9B, thinking is disabled by default.
+- **Voice-first responses:** All non-tool assistant responses filtered for
+  conciseness (20-400 chars) and conversational tone (no markdown, lists,
+  or code).