---
license: apache-2.0
base_model: unsloth/Qwen3.5-2B
tags:
- voice-assistant
- tool-calling
- qwen3.5
- native-tool-format
- lora
- sft
language:
- en
---
# Qwen3.5-2B Voice Assistant (Tool Calling)
LoRA fine-tuned **unsloth/Qwen3.5-2B** for hands-free voice assistance with
native Qwen3.5 XML tool calling. Trained on 11044 conversations
(572 tool-call, 10472 voice-only).
## Tool Call Format
This model uses the **native Qwen3.5 XML parameter format** — the same format
produced by the model's built-in `chat_template.jinja`. No custom prompt
engineering is needed at inference.
```xml
Austin
```
This is parsed automatically by llama.cpp (`--jinja`), vLLM, LM Studio,
and Ollama when using the bundled chat template.
## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
```
## Inference
### llama-server
```bash
./llama.cpp/build/bin/llama-server \
-m unsloth/Qwen3.5-2B-q4_k_m.gguf \
--jinja \
--ctx-size 2048 \
--temp 0.7 \
--top-p 0.9 \
--repeat-penalty 1.0 \
--host 0.0.0.0 \
--port 8080
```
> **Important:** Use `--jinja` — this reads the native `chat_template.jinja`
> bundled with the model, which handles tool schema injection and output parsing
> automatically. `--repeat-penalty 1.0` is critical — higher values corrupt
> XML structure in tool calls.
### OpenAI SDK (via llama-server or vLLM)
```python
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
tools = json.load(open("tools.json"))
response = client.chat.completions.create(
model="your-model",
messages=[
{"role": "system", "content": "You are a casual, hands-free voice assistant..."},
{"role": "user", "content": "What's the weather in Austin?"},
],
tools=tools,
temperature=0.7,
top_p=0.9,
)
message = response.choices[0].message
if message.tool_calls:
tool_call = message.tool_calls[0]
args = tool_call.function.arguments
if isinstance(args, str):
args = json.loads(args)
tool_result = execute_tool(tool_call.function.name, args)
response2 = client.chat.completions.create(
model="your-model",
messages=[
{"role": "system", "content": "You are a casual, hands-free voice assistant..."},
{"role": "user", "content": "What's the weather in Austin?"},
message,
{"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(tool_result)},
],
tools=tools,
temperature=0.7,
)
spoken = response2.choices[0].message.content
else:
spoken = message.content
```
> **Known issue ([llama.cpp #20198](https://github.com/ggml-org/llama.cpp/issues/20198)):**
> `arguments` may be returned as a dict instead of a JSON string. The `isinstance`
> check above handles both.
### Transformers (direct)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools", device_map="auto")
messages = [
{"role": "system", "content": "You are a casual, hands-free voice assistant..."},
{"role": "user", "content": "Set a timer for 5 minutes"},
]
tools = json.load(open("tools.json"))
# Native template handles tool schema injection automatically
prompt = tokenizer.apply_chat_template(
messages,
tools=tools,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False))
```
### vLLM
```bash
vllm serve cowWhySo/qwen3_5_2B_voice_assistant_tools \
--max-model-len 2048 \
--enable-auto-tool-choice \
--tool-call-parser hermes
```
## Training Details
| Parameter | Value |
|---|---|
| Base model | `unsloth/Qwen3.5-2B` |
| Method | LoRA (r=16, alpha=32) |
| Precision | bf16 |
| Max seq length | 2048 |
| Learning rate | 0.0001 |
| Effective batch size | 64 |
| Epochs | 3 |
| Early stopping | patience=3 (eval every 15 steps) |
| Thinking | Disabled |
## Tools
`get_weather` · `set_timer` · `create_reminder` · `control_smart_home` · `play_music` · `web_search`
Full tool schemas are in `tools.json` in this repo.
## Design Decisions
- **Native Qwen3.5 format:** Training data formatted using the model's own
`chat_template.jinja`, so tool calls use the XML parameter format
(`value`) that every
inference framework expects. Zero custom prompt engineering at deployment.
- **Tools always visible:** Every training example (including voice-only) sees
tool schemas in the system prompt, teaching the model when NOT to call tools.
- **Thinking disabled:** `enable_thinking=False` throughout training and
inference — avoids reasoning loops on a 2B model and keeps voice responses
instant. For Qwen3.5 0.8B/2B/4B/9B, thinking is disabled by default.
- **Voice-first responses:** All non-tool assistant responses filtered for
conciseness (20-400 chars) and conversational tone (no markdown, lists,
or code).