--- license: mpl-2.0 base_model: Qwen/Qwen3-1.7B tags: - tool-use - function-calling - reinforcement-learning - ppo - mcp - trl - peft - low-rank-adaptation model_creator: Igriscodes pipeline_tag: text-generation language: - en metrics: - reward --- # qwen-tool This model is a fine-tuned version of `Qwen/Qwen3-1.7B`, optimized for complex functional calling and multi-step tool use via the **Model Context Protocol (MCP)**. The model was aligned using **Proximal Policy Optimization (PPO)** in a closed-loop agentic environment. It leverages execution-based feedback from an MCP server to drastically reduce tool hallucinations, adhere to strict JSON formatting, and self-correct based on execution error states. ## Model Details - **Developed by:** [Igriscodes](https://github.com/Igriscodes) - **Base Model:** `Qwen/Qwen3-1.7B` - **License:** Mozilla Public License 2.0 (MPL 2.0) - **Training Framework:** Hugging Face `trl` & `peft` (LoRA) - **Alignment Method:** PPO (Proximal Policy Optimization) with Execution-Based Reward Guidance ## Intended Uses & Limitations ### Intended Use Cases - **Structured Tool Calling:** Interfacing natively with Model Context Protocol (MCP) servers. - **Multi-step Agentic Tasks:** Iterative problem-solving across math, web searching, database queries, and data processing. - **Error-Resilient Agents:** Handling tool-execution errors gracefully by rewriting payload schemas based on environment exceptions. --- ## Training Architecture & Alignment Loop The model was trained as the **Policy (Actor)** within a custom `gymnasium` environment (`MCPGymEnv`). The environment tracks an execution loop between the model's textual outputs and a backend mock MCP server. ### Reward Specification Matrix The PPO agent was optimized against a dense, feedback-driven execution reward model: | Trigger Status | Reward | Evaluation Logic | | :--- | :--- | :--- | | **Success** | `+10.0` | Tool executed cleanly; returned data matches the expected task state. | | **Tool Execution** | `0.0` | Tool ran successfully, but the overarching objective is incomplete. | | **Tool Error** | `-0.5` | Target tool was hit, but threw a runtime exception (e.g., bad arguments). | | **Invalid JSON** | `-0.8` | Failed to output a syntactically valid JSON tool-call schema. | | **Structural Fail** | `-1.0` | Severe divergence from agentic system instructions or tool hallucination. | ### Hyperparameters & Efficiency Stack - **Quantization:** 4-bit NormalFloat (NF4) via `bitsandbytes` (for base model loading). - **PEFT Adaptation:** LoRA targeted all linear layers (`q_proj`, `v_proj`, `k_proj`, `o_proj`, etc.). - **Memory Optimization:** 8-bit Paged AdamW optimizer, gradient checkpointing, and parallel rollout sampling to balance the Actor-Critic-Reference model triplet footprint. --- ## Acknowledgements We express our gratitude to the following organizations, communities, and tools that made this project possible: * **[Qwen (Alibaba Cloud)](https://github.com/QwenLM/Qwen)** - For providing the foundational **Qwen3** model weights and architecture. * **[Hugging Face](https://huggingface.co/)** - For the incredible ecosystem and libraries used to load, manage, and train the model. * **[PyTorch](https://pytorch.org/)** - For the robust, deep learning framework that powered the underlying tensor computations and GPU acceleration during fine-tuning. * **[Google Gemini 3](https://geminicli.com/)** - For providing assistance in optimizing, and debugging the fine-tuning code scripts. ## License [Mozilla Public License Version 2.0](https://github.com/Igriscodes/qwen-tool/blob/main/LICENSE) - Feel free to use and modify