--- license: apache-2.0 base_model: unsloth/Qwen3.5-2B tags: - voice-assistant - tool-calling - qwen3.5 - native-tool-format - lora - sft language: - en --- # Qwen3.5-2B Voice Assistant (Tool Calling) LoRA fine-tuned **unsloth/Qwen3.5-2B** for hands-free voice assistance with native Qwen3.5 XML tool calling. Trained on 11044 conversations (572 tool-call, 10472 voice-only). ## Tool Call Format This model uses the **native Qwen3.5 XML parameter format** — the same format produced by the model's built-in `chat_template.jinja`. No custom prompt engineering is needed at inference. ```xml Austin ``` This is parsed automatically by llama.cpp (`--jinja`), vLLM, LM Studio, and Ollama when using the bundled chat template. ## Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools") tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools") ``` ## Inference ### llama-server ```bash ./llama.cpp/build/bin/llama-server \ -m unsloth/Qwen3.5-2B-q4_k_m.gguf \ --jinja \ --ctx-size 2048 \ --temp 0.7 \ --top-p 0.9 \ --repeat-penalty 1.0 \ --host 0.0.0.0 \ --port 8080 ``` > **Important:** Use `--jinja` — this reads the native `chat_template.jinja` > bundled with the model, which handles tool schema injection and output parsing > automatically. `--repeat-penalty 1.0` is critical — higher values corrupt > XML structure in tool calls. ### OpenAI SDK (via llama-server or vLLM) ```python from openai import OpenAI import json client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused") tools = json.load(open("tools.json")) response = client.chat.completions.create( model="your-model", messages=[ {"role": "system", "content": "You are a casual, hands-free voice assistant..."}, {"role": "user", "content": "What's the weather in Austin?"}, ], tools=tools, temperature=0.7, top_p=0.9, ) message = response.choices[0].message if message.tool_calls: tool_call = message.tool_calls[0] args = tool_call.function.arguments if isinstance(args, str): args = json.loads(args) tool_result = execute_tool(tool_call.function.name, args) response2 = client.chat.completions.create( model="your-model", messages=[ {"role": "system", "content": "You are a casual, hands-free voice assistant..."}, {"role": "user", "content": "What's the weather in Austin?"}, message, {"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(tool_result)}, ], tools=tools, temperature=0.7, ) spoken = response2.choices[0].message.content else: spoken = message.content ``` > **Known issue ([llama.cpp #20198](https://github.com/ggml-org/llama.cpp/issues/20198)):** > `arguments` may be returned as a dict instead of a JSON string. The `isinstance` > check above handles both. ### Transformers (direct) ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools") model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools", device_map="auto") messages = [ {"role": "system", "content": "You are a casual, hands-free voice assistant..."}, {"role": "user", "content": "Set a timer for 5 minutes"}, ] tools = json.load(open("tools.json")) # Native template handles tool schema injection automatically prompt = tokenizer.apply_chat_template( messages, tools=tools, tokenize=False, add_generation_prompt=True, enable_thinking=False, ) inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device) output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9) print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)) ``` ### vLLM ```bash vllm serve cowWhySo/qwen3_5_2B_voice_assistant_tools \ --max-model-len 2048 \ --enable-auto-tool-choice \ --tool-call-parser hermes ``` ## Training Details | Parameter | Value | |---|---| | Base model | `unsloth/Qwen3.5-2B` | | Method | LoRA (r=16, alpha=32) | | Precision | bf16 | | Max seq length | 2048 | | Learning rate | 0.0001 | | Effective batch size | 64 | | Epochs | 3 | | Early stopping | patience=3 (eval every 15 steps) | | Thinking | Disabled | ## Tools `get_weather` · `set_timer` · `create_reminder` · `control_smart_home` · `play_music` · `web_search` Full tool schemas are in `tools.json` in this repo. ## Design Decisions - **Native Qwen3.5 format:** Training data formatted using the model's own `chat_template.jinja`, so tool calls use the XML parameter format (`value`) that every inference framework expects. Zero custom prompt engineering at deployment. - **Tools always visible:** Every training example (including voice-only) sees tool schemas in the system prompt, teaching the model when NOT to call tools. - **Thinking disabled:** `enable_thinking=False` throughout training and inference — avoids reasoning loops on a 2B model and keeps voice responses instant. For Qwen3.5 0.8B/2B/4B/9B, thinking is disabled by default. - **Voice-first responses:** All non-tool assistant responses filtered for conciseness (20-400 chars) and conversational tone (no markdown, lists, or code).