---
license: apache-2.0
base_model: unsloth/Qwen3.5-2B
tags:
  - voice-assistant
  - tool-calling
  - qwen3.5
  - native-tool-format
  - lora
  - sft
language:
  - en
---

# Qwen3.5-2B Voice Assistant (Tool Calling)

LoRA fine-tuned **unsloth/Qwen3.5-2B** for hands-free voice assistance with
native Qwen3.5 XML tool calling. Trained on 11044 conversations
(572 tool-call, 10472 voice-only).

## Tool Call Format

This model uses the **native Qwen3.5 XML parameter format** — the same format
produced by the model's built-in `chat_template.jinja`. No custom prompt
engineering is needed at inference.
```xml
<tool_call>
<function=get_weather>
<parameter=location>
Austin
</parameter>
</function>
</tool_call>
```

This is parsed automatically by llama.cpp (`--jinja`), vLLM, LM Studio,
and Ollama when using the bundled chat template.

## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
```

## Inference

### llama-server
```bash
./llama.cpp/build/bin/llama-server \
    -m unsloth/Qwen3.5-2B-q4_k_m.gguf \
    --jinja \
    --ctx-size 2048 \
    --temp 0.7 \
    --top-p 0.9 \
    --repeat-penalty 1.0 \
    --host 0.0.0.0 \
    --port 8080
```

> **Important:** Use `--jinja` — this reads the native `chat_template.jinja`
> bundled with the model, which handles tool schema injection and output parsing
> automatically. `--repeat-penalty 1.0` is critical — higher values corrupt
> XML structure in tool calls.

### OpenAI SDK (via llama-server or vLLM)
```python
from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
tools = json.load(open("tools.json"))

response = client.chat.completions.create(
    model="your-model",
    messages=[
        {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
        {"role": "user", "content": "What's the weather in Austin?"},
    ],
    tools=tools,
    temperature=0.7,
    top_p=0.9,
)

message = response.choices[0].message

if message.tool_calls:
    tool_call = message.tool_calls[0]
    args = tool_call.function.arguments
    if isinstance(args, str):
        args = json.loads(args)
    tool_result = execute_tool(tool_call.function.name, args)

    response2 = client.chat.completions.create(
        model="your-model",
        messages=[
            {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
            {"role": "user", "content": "What's the weather in Austin?"},
            message,
            {"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(tool_result)},
        ],
        tools=tools,
        temperature=0.7,
    )
    spoken = response2.choices[0].message.content
else:
    spoken = message.content
```

> **Known issue ([llama.cpp #20198](https://github.com/ggml-org/llama.cpp/issues/20198)):**
> `arguments` may be returned as a dict instead of a JSON string. The `isinstance`
> check above handles both.

### Transformers (direct)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools", device_map="auto")

messages = [
    {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
    {"role": "user", "content": "Set a timer for 5 minutes"},
]
tools = json.load(open("tools.json"))

# Native template handles tool schema injection automatically
prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False))
```

### vLLM
```bash
vllm serve cowWhySo/qwen3_5_2B_voice_assistant_tools \
    --max-model-len 2048 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes
```

## Training Details

| Parameter | Value |
|---|---|
| Base model | `unsloth/Qwen3.5-2B` |
| Method | LoRA (r=16, alpha=32) |
| Precision | bf16 |
| Max seq length | 2048 |
| Learning rate | 0.0001 |
| Effective batch size | 64 |
| Epochs | 3 |
| Early stopping | patience=3 (eval every 15 steps) |
| Thinking | Disabled |

## Tools

`get_weather` · `set_timer` · `create_reminder` · `control_smart_home` · `play_music` · `web_search`

Full tool schemas are in `tools.json` in this repo.

## Design Decisions

- **Native Qwen3.5 format:** Training data formatted using the model's own
  `chat_template.jinja`, so tool calls use the XML parameter format
  (`<function=name><parameter=key>value</parameter></function>`) that every
  inference framework expects. Zero custom prompt engineering at deployment.
- **Tools always visible:** Every training example (including voice-only) sees
  tool schemas in the system prompt, teaching the model when NOT to call tools.
- **Thinking disabled:** `enable_thinking=False` throughout training and
  inference — avoids reasoning loops on a 2B model and keeps voice responses
  instant. For Qwen3.5 0.8B/2B/4B/9B, thinking is disabled by default.
- **Voice-first responses:** All non-tool assistant responses filtered for
  conciseness (20-400 chars) and conversational tone (no markdown, lists,
  or code).