---
license: mpl-2.0
base_model: Qwen/Qwen3-1.7B
tags:
- tool-use
- function-calling
- reinforcement-learning
- ppo
- mcp
- trl
- peft
- low-rank-adaptation
model_creator: Igriscodes
pipeline_tag: text-generation
language:
- en
metrics:
- reward
---

# qwen-tool

This model is a fine-tuned version of `Qwen/Qwen3-1.7B`, optimized for complex functional calling and multi-step tool use via the **Model Context Protocol (MCP)**. 

The model was aligned using **Proximal Policy Optimization (PPO)** in a closed-loop agentic environment. It leverages execution-based feedback from an MCP server to drastically reduce tool hallucinations, adhere to strict JSON formatting, and self-correct based on execution error states.

## Model Details

- **Developed by:** [Igriscodes](https://github.com/Igriscodes)
- **Base Model:** `Qwen/Qwen3-1.7B`
- **License:** Mozilla Public License 2.0 (MPL 2.0)
- **Training Framework:** Hugging Face `trl` & `peft` (LoRA)
- **Alignment Method:** PPO (Proximal Policy Optimization) with Execution-Based Reward Guidance

## Intended Uses & Limitations

### Intended Use Cases
- **Structured Tool Calling:** Interfacing natively with Model Context Protocol (MCP) servers.
- **Multi-step Agentic Tasks:** Iterative problem-solving across math, web searching, database queries, and data processing.
- **Error-Resilient Agents:** Handling tool-execution errors gracefully by rewriting payload schemas based on environment exceptions.

---

## Training Architecture & Alignment Loop

The model was trained as the **Policy (Actor)** within a custom `gymnasium` environment (`MCPGymEnv`). The environment tracks an execution loop between the model's textual outputs and a backend mock MCP server.


### Reward Specification Matrix

The PPO agent was optimized against a dense, feedback-driven execution reward model:

| Trigger Status | Reward | Evaluation Logic |
| :--- | :--- | :--- |
| **Success** | `+10.0` | Tool executed cleanly; returned data matches the expected task state. |
| **Tool Execution** | `0.0` | Tool ran successfully, but the overarching objective is incomplete. |
| **Tool Error** | `-0.5` | Target tool was hit, but threw a runtime exception (e.g., bad arguments). |
| **Invalid JSON** | `-0.8` | Failed to output a syntactically valid JSON tool-call schema. |
| **Structural Fail** | `-1.0` | Severe divergence from agentic system instructions or tool hallucination. |

### Hyperparameters & Efficiency Stack
- **Quantization:** 4-bit NormalFloat (NF4) via `bitsandbytes` (for base model loading).
- **PEFT Adaptation:** LoRA targeted all linear layers (`q_proj`, `v_proj`, `k_proj`, `o_proj`, etc.).
- **Memory Optimization:** 8-bit Paged AdamW optimizer, gradient checkpointing, and parallel rollout sampling to balance the Actor-Critic-Reference model triplet footprint.

---

## Acknowledgements

We express our gratitude to the following organizations, communities, and tools that made this project possible:

*   **[Qwen (Alibaba Cloud)](https://github.com/QwenLM/Qwen)** - For providing the foundational **Qwen3** model weights and architecture.
*   **[Hugging Face](https://huggingface.co/)** - For the incredible ecosystem and libraries used to load, manage, and train the model.
*   **[PyTorch](https://pytorch.org/)** - For the robust, deep learning framework that powered the underlying tensor computations and GPU acceleration during fine-tuning.
*   **[Google Gemini 3](https://geminicli.com/)** - For providing assistance in optimizing, and debugging the fine-tuning code scripts.

## License
[Mozilla Public License Version 2.0](https://github.com/Igriscodes/qwen-tool/blob/main/LICENSE) - Feel free to use and modify