File size: 1,885 Bytes
abf58ec fff73c7 abf58ec fff73c7 a31a38c 33024f3 fff73c7 a31a38c fff73c7 a31a38c fff73c7 a31a38c fff73c7 a31a38c fff73c7 a31a38c fff73c7 a31a38c fff73c7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | ---
base_model:
- Qwen/Qwen2.5-7B-Instruct
---
Reproduces the core idea of [AgentFlow](https://arxiv.org/abs/2510.05592): extending single-step LLM inference into a multi-turn **Planner β Executor β Verifier** agent loop, applying RL signals (GRPO) to the Planner's generation trajectory. This allows the model to improve its tool-use and reasoning capabilities without requiring manually annotated intermediate steps.
Our code hub:
https://github.com/LMIS-ORG/slime-agentic?tab=readme-ov-file
#### Architecture
```
Input question
β
βΌ
Planner.plan() β Analyze the problem and devise a solution strategy (loss_mask=1)
β
βββΊ for step in range(max_steps):
β
ββ Planner.generate_next_step() β Select next tool and sub-goal (loss_mask=1)
ββ Executor.generate_tool_command()
β + execute_command() β Invoke tool (excluded from sequence)
ββ Verifier.verificate_context() β Decide whether to continue (excluded)
ββ Memory.add_action() β Record execution result
β
βΌ
Planner.generate_final_output() β Summarize results and produce final answer (loss_mask=0)
β
βΌ
Rewarder.compute_reward() β LLM-as-Judge: compare model answer with ground truth
```
#### Tools (`tools/`)
| Tool | Description |
|---|---|
| `base_generator` | General-purpose text generation tool; answers sub-tasks directly via LLM |
| `python_coder` | Python code generation and execution tool for math computation and algorithmic problem solving |
#### Results
| Model | Dataset | Baseline | AgentFlow (Ours) | Improvement |
|---|---|---|---|---|
| Qwen2.5-7B-Instruct | AIME 2024 | 10.0% | 30.0% | +20.0% |
> **Note:** Due to limited training resources, the AgentFlow model was only trained for 100 steps.
|