Igriscodes commited on
Commit
f91cd5f
·
verified ·
1 Parent(s): fb10d84

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md CHANGED
@@ -1,3 +1,78 @@
1
  ---
2
  license: mpl-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mpl-2.0
3
+ base_model: Qwen/Qwen3-1.7B
4
+ tags:
5
+ - tool-use
6
+ - function-calling
7
+ - reinforcement-learning
8
+ - ppo
9
+ - mcp
10
+ - trl
11
+ - peft
12
+ - low-rank-adaptation
13
+ model_creator: Igriscodes
14
+ pipeline_tag: text-generation
15
+ language:
16
+ - en
17
+ metrics:
18
+ - reward
19
  ---
20
+
21
+ # qwen-tool
22
+
23
+ This model is a fine-tuned version of `Qwen/Qwen3-1.7B`, optimized for complex functional calling and multi-step tool use via the **Model Context Protocol (MCP)**.
24
+
25
+ The model was aligned using **Proximal Policy Optimization (PPO)** in a closed-loop agentic environment. It leverages execution-based feedback from an MCP server to drastically reduce tool hallucinations, adhere to strict JSON formatting, and self-correct based on execution error states.
26
+
27
+ ## Model Details
28
+
29
+ - **Developed by:** [Igriscodes](https://github.com/Igriscodes)
30
+ - **Base Model:** `Qwen/Qwen3-1.7B`
31
+ - **License:** Mozilla Public License 2.0 (MPL 2.0)
32
+ - **Training Framework:** Hugging Face `trl` & `peft` (LoRA)
33
+ - **Alignment Method:** PPO (Proximal Policy Optimization) with Execution-Based Reward Guidance
34
+
35
+ ## Intended Uses & Limitations
36
+
37
+ ### Intended Use Cases
38
+ - **Structured Tool Calling:** Interfacing natively with Model Context Protocol (MCP) servers.
39
+ - **Multi-step Agentic Tasks:** Iterative problem-solving across math, web searching, database queries, and data processing.
40
+ - **Error-Resilient Agents:** Handling tool-execution errors gracefully by rewriting payload schemas based on environment exceptions.
41
+
42
+ ---
43
+
44
+ ## Training Architecture & Alignment Loop
45
+
46
+ The model was trained as the **Policy (Actor)** within a custom `gymnasium` environment (`MCPGymEnv`). The environment tracks an execution loop between the model's textual outputs and a backend mock MCP server.
47
+
48
+
49
+ ### Reward Specification Matrix
50
+
51
+ The PPO agent was optimized against a dense, feedback-driven execution reward model:
52
+
53
+ | Trigger Status | Reward | Evaluation Logic |
54
+ | :--- | :--- | :--- |
55
+ | **Success** | `+10.0` | Tool executed cleanly; returned data matches the expected task state. |
56
+ | **Tool Execution** | `0.0` | Tool ran successfully, but the overarching objective is incomplete. |
57
+ | **Tool Error** | `-0.5` | Target tool was hit, but threw a runtime exception (e.g., bad arguments). |
58
+ | **Invalid JSON** | `-0.8` | Failed to output a syntactically valid JSON tool-call schema. |
59
+ | **Structural Fail** | `-1.0` | Severe divergence from agentic system instructions or tool hallucination. |
60
+
61
+ ### Hyperparameters & Efficiency Stack
62
+ - **Quantization:** 4-bit NormalFloat (NF4) via `bitsandbytes` (for base model loading).
63
+ - **PEFT Adaptation:** LoRA targeted all linear layers (`q_proj`, `v_proj`, `k_proj`, `o_proj`, etc.).
64
+ - **Memory Optimization:** 8-bit Paged AdamW optimizer, gradient checkpointing, and parallel rollout sampling to balance the Actor-Critic-Reference model triplet footprint.
65
+
66
+ ---
67
+
68
+ ## Acknowledgements
69
+
70
+ We express our gratitude to the following organizations, communities, and tools that made this project possible:
71
+
72
+ * **[Qwen (Alibaba Cloud)](https://github.com/QwenLM/Qwen)** - For providing the foundational **Qwen3** model weights and architecture.
73
+ * **[Hugging Face](https://huggingface.co/)** - For the incredible ecosystem and libraries used to load, manage, and train the model.
74
+ * **[PyTorch](https://pytorch.org/)** - For the robust, deep learning framework that powered the underlying tensor computations and GPU acceleration during fine-tuning.
75
+ * **[Google Gemini 3](https://geminicli.com/)** - For providing assistance in optimizing, and debugging the fine-tuning code scripts.
76
+
77
+ ## License
78
+ [Mozilla Public License Version 2.0](https://github.com/Igriscodes/qwen-tool/blob/main/LICENSE) - Feel free to use and modify