Text Generation
PEFT
Safetensors
English
qwen3
tool-use
function-calling
reinforcement-learning
ppo
mcp
trl
low-rank-adaptation
Instructions to use Igriscodes/qwen3-1.7b-tool with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Igriscodes/qwen3-1.7b-tool with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,78 @@
|
|
| 1 |
---
|
| 2 |
license: mpl-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mpl-2.0
|
| 3 |
+
base_model: Qwen/Qwen3-1.7B
|
| 4 |
+
tags:
|
| 5 |
+
- tool-use
|
| 6 |
+
- function-calling
|
| 7 |
+
- reinforcement-learning
|
| 8 |
+
- ppo
|
| 9 |
+
- mcp
|
| 10 |
+
- trl
|
| 11 |
+
- peft
|
| 12 |
+
- low-rank-adaptation
|
| 13 |
+
model_creator: Igriscodes
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
+
language:
|
| 16 |
+
- en
|
| 17 |
+
metrics:
|
| 18 |
+
- reward
|
| 19 |
---
|
| 20 |
+
|
| 21 |
+
# qwen-tool
|
| 22 |
+
|
| 23 |
+
This model is a fine-tuned version of `Qwen/Qwen3-1.7B`, optimized for complex functional calling and multi-step tool use via the **Model Context Protocol (MCP)**.
|
| 24 |
+
|
| 25 |
+
The model was aligned using **Proximal Policy Optimization (PPO)** in a closed-loop agentic environment. It leverages execution-based feedback from an MCP server to drastically reduce tool hallucinations, adhere to strict JSON formatting, and self-correct based on execution error states.
|
| 26 |
+
|
| 27 |
+
## Model Details
|
| 28 |
+
|
| 29 |
+
- **Developed by:** [Igriscodes](https://github.com/Igriscodes)
|
| 30 |
+
- **Base Model:** `Qwen/Qwen3-1.7B`
|
| 31 |
+
- **License:** Mozilla Public License 2.0 (MPL 2.0)
|
| 32 |
+
- **Training Framework:** Hugging Face `trl` & `peft` (LoRA)
|
| 33 |
+
- **Alignment Method:** PPO (Proximal Policy Optimization) with Execution-Based Reward Guidance
|
| 34 |
+
|
| 35 |
+
## Intended Uses & Limitations
|
| 36 |
+
|
| 37 |
+
### Intended Use Cases
|
| 38 |
+
- **Structured Tool Calling:** Interfacing natively with Model Context Protocol (MCP) servers.
|
| 39 |
+
- **Multi-step Agentic Tasks:** Iterative problem-solving across math, web searching, database queries, and data processing.
|
| 40 |
+
- **Error-Resilient Agents:** Handling tool-execution errors gracefully by rewriting payload schemas based on environment exceptions.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## Training Architecture & Alignment Loop
|
| 45 |
+
|
| 46 |
+
The model was trained as the **Policy (Actor)** within a custom `gymnasium` environment (`MCPGymEnv`). The environment tracks an execution loop between the model's textual outputs and a backend mock MCP server.
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
### Reward Specification Matrix
|
| 50 |
+
|
| 51 |
+
The PPO agent was optimized against a dense, feedback-driven execution reward model:
|
| 52 |
+
|
| 53 |
+
| Trigger Status | Reward | Evaluation Logic |
|
| 54 |
+
| :--- | :--- | :--- |
|
| 55 |
+
| **Success** | `+10.0` | Tool executed cleanly; returned data matches the expected task state. |
|
| 56 |
+
| **Tool Execution** | `0.0` | Tool ran successfully, but the overarching objective is incomplete. |
|
| 57 |
+
| **Tool Error** | `-0.5` | Target tool was hit, but threw a runtime exception (e.g., bad arguments). |
|
| 58 |
+
| **Invalid JSON** | `-0.8` | Failed to output a syntactically valid JSON tool-call schema. |
|
| 59 |
+
| **Structural Fail** | `-1.0` | Severe divergence from agentic system instructions or tool hallucination. |
|
| 60 |
+
|
| 61 |
+
### Hyperparameters & Efficiency Stack
|
| 62 |
+
- **Quantization:** 4-bit NormalFloat (NF4) via `bitsandbytes` (for base model loading).
|
| 63 |
+
- **PEFT Adaptation:** LoRA targeted all linear layers (`q_proj`, `v_proj`, `k_proj`, `o_proj`, etc.).
|
| 64 |
+
- **Memory Optimization:** 8-bit Paged AdamW optimizer, gradient checkpointing, and parallel rollout sampling to balance the Actor-Critic-Reference model triplet footprint.
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## Acknowledgements
|
| 69 |
+
|
| 70 |
+
We express our gratitude to the following organizations, communities, and tools that made this project possible:
|
| 71 |
+
|
| 72 |
+
* **[Qwen (Alibaba Cloud)](https://github.com/QwenLM/Qwen)** - For providing the foundational **Qwen3** model weights and architecture.
|
| 73 |
+
* **[Hugging Face](https://huggingface.co/)** - For the incredible ecosystem and libraries used to load, manage, and train the model.
|
| 74 |
+
* **[PyTorch](https://pytorch.org/)** - For the robust, deep learning framework that powered the underlying tensor computations and GPU acceleration during fine-tuning.
|
| 75 |
+
* **[Google Gemini 3](https://geminicli.com/)** - For providing assistance in optimizing, and debugging the fine-tuning code scripts.
|
| 76 |
+
|
| 77 |
+
## License
|
| 78 |
+
[Mozilla Public License Version 2.0](https://github.com/Igriscodes/qwen-tool/blob/main/LICENSE) - Feel free to use and modify
|