cowWhySo commited on
Commit
c1a919a
·
verified ·
1 Parent(s): 5c6928e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +173 -13
README.md CHANGED
@@ -1,22 +1,182 @@
1
  ---
 
2
  base_model: unsloth/Qwen3.5-2B
3
  tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - qwen3_5
8
- - trl
9
- license: apache-2.0
10
  language:
11
- - en
12
  ---
13
 
14
- # Uploaded model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- - **Developed by:** cowWhySo
17
- - **License:** apache-2.0
18
- - **Finetuned from model :** unsloth/Qwen3.5-2B
19
 
20
- This qwen3_5 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)
21
 
22
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  base_model: unsloth/Qwen3.5-2B
4
  tags:
5
+ - voice-assistant
6
+ - tool-calling
7
+ - qwen3.5
8
+ - native-tool-format
9
+ - lora
10
+ - sft
11
  language:
12
+ - en
13
  ---
14
 
15
+ # Qwen3.5-2B Voice Assistant (Tool Calling)
16
+
17
+ LoRA fine-tuned **unsloth/Qwen3.5-2B** for hands-free voice assistance with
18
+ native Qwen3.5 XML tool calling. Trained on 11044 conversations
19
+ (572 tool-call, 10472 voice-only).
20
+
21
+ ## Tool Call Format
22
+
23
+ This model uses the **native Qwen3.5 XML parameter format** — the same format
24
+ produced by the model's built-in `chat_template.jinja`. No custom prompt
25
+ engineering is needed at inference.
26
+ ```xml
27
+ <tool_call>
28
+ <function=get_weather>
29
+ <parameter=location>
30
+ Austin
31
+ </parameter>
32
+ </function>
33
+ </tool_call>
34
+ ```
35
+
36
+ This is parsed automatically by llama.cpp (`--jinja`), vLLM, LM Studio,
37
+ and Ollama when using the bundled chat template.
38
+
39
+ ## Quick Start
40
+ ```python
41
+ from transformers import AutoModelForCausalLM, AutoTokenizer
42
+
43
+ model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
44
+ tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
45
+ ```
46
+
47
+ ## Inference
48
+
49
+ ### llama-server
50
+ ```bash
51
+ ./llama.cpp/build/bin/llama-server \
52
+ -m unsloth/Qwen3.5-2B-q4_k_m.gguf \
53
+ --jinja \
54
+ --ctx-size 2048 \
55
+ --temp 0.7 \
56
+ --top-p 0.9 \
57
+ --repeat-penalty 1.0 \
58
+ --host 0.0.0.0 \
59
+ --port 8080
60
+ ```
61
+
62
+ > **Important:** Use `--jinja` — this reads the native `chat_template.jinja`
63
+ > bundled with the model, which handles tool schema injection and output parsing
64
+ > automatically. `--repeat-penalty 1.0` is critical — higher values corrupt
65
+ > XML structure in tool calls.
66
+
67
+ ### OpenAI SDK (via llama-server or vLLM)
68
+ ```python
69
+ from openai import OpenAI
70
+ import json
71
+
72
+ client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
73
+ tools = json.load(open("tools.json"))
74
+
75
+ response = client.chat.completions.create(
76
+ model="your-model",
77
+ messages=[
78
+ {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
79
+ {"role": "user", "content": "What's the weather in Austin?"},
80
+ ],
81
+ tools=tools,
82
+ temperature=0.7,
83
+ top_p=0.9,
84
+ )
85
+
86
+ message = response.choices[0].message
87
+
88
+ if message.tool_calls:
89
+ tool_call = message.tool_calls[0]
90
+ args = tool_call.function.arguments
91
+ if isinstance(args, str):
92
+ args = json.loads(args)
93
+ tool_result = execute_tool(tool_call.function.name, args)
94
+
95
+ response2 = client.chat.completions.create(
96
+ model="your-model",
97
+ messages=[
98
+ {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
99
+ {"role": "user", "content": "What's the weather in Austin?"},
100
+ message,
101
+ {"role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(tool_result)},
102
+ ],
103
+ tools=tools,
104
+ temperature=0.7,
105
+ )
106
+ spoken = response2.choices[0].message.content
107
+ else:
108
+ spoken = message.content
109
+ ```
110
+
111
+ > **Known issue ([llama.cpp #20198](https://github.com/ggml-org/llama.cpp/issues/20198)):**
112
+ > `arguments` may be returned as a dict instead of a JSON string. The `isinstance`
113
+ > check above handles both.
114
+
115
+ ### Transformers (direct)
116
+ ```python
117
+ from transformers import AutoTokenizer, AutoModelForCausalLM
118
+
119
+ tokenizer = AutoTokenizer.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools")
120
+ model = AutoModelForCausalLM.from_pretrained("cowWhySo/qwen3_5_2B_voice_assistant_tools", device_map="auto")
121
+
122
+ messages = [
123
+ {"role": "system", "content": "You are a casual, hands-free voice assistant..."},
124
+ {"role": "user", "content": "Set a timer for 5 minutes"},
125
+ ]
126
+ tools = json.load(open("tools.json"))
127
+
128
+ # Native template handles tool schema injection automatically
129
+ prompt = tokenizer.apply_chat_template(
130
+ messages,
131
+ tools=tools,
132
+ tokenize=False,
133
+ add_generation_prompt=True,
134
+ enable_thinking=False,
135
+ )
136
+ inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
137
+ output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
138
+ print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False))
139
+ ```
140
+
141
+ ### vLLM
142
+ ```bash
143
+ vllm serve cowWhySo/qwen3_5_2B_voice_assistant_tools \
144
+ --max-model-len 2048 \
145
+ --enable-auto-tool-choice \
146
+ --tool-call-parser hermes
147
+ ```
148
+
149
+ ## Training Details
150
+
151
+ | Parameter | Value |
152
+ |---|---|
153
+ | Base model | `unsloth/Qwen3.5-2B` |
154
+ | Method | LoRA (r=16, alpha=32) |
155
+ | Precision | bf16 |
156
+ | Max seq length | 2048 |
157
+ | Learning rate | 0.0001 |
158
+ | Effective batch size | 64 |
159
+ | Epochs | 3 |
160
+ | Early stopping | patience=3 (eval every 15 steps) |
161
+ | Thinking | Disabled |
162
+
163
+ ## Tools
164
+
165
+ `get_weather` · `set_timer` · `create_reminder` · `control_smart_home` · `play_music` · `web_search`
166
 
167
+ Full tool schemas are in `tools.json` in this repo.
 
 
168
 
169
+ ## Design Decisions
170
 
171
+ - **Native Qwen3.5 format:** Training data formatted using the model's own
172
+ `chat_template.jinja`, so tool calls use the XML parameter format
173
+ (`<function=name><parameter=key>value</parameter></function>`) that every
174
+ inference framework expects. Zero custom prompt engineering at deployment.
175
+ - **Tools always visible:** Every training example (including voice-only) sees
176
+ tool schemas in the system prompt, teaching the model when NOT to call tools.
177
+ - **Thinking disabled:** `enable_thinking=False` throughout training and
178
+ inference — avoids reasoning loops on a 2B model and keeps voice responses
179
+ instant. For Qwen3.5 0.8B/2B/4B/9B, thinking is disabled by default.
180
+ - **Voice-first responses:** All non-tool assistant responses filtered for
181
+ conciseness (20-400 chars) and conversational tone (no markdown, lists,
182
+ or code).