--- base_model: - Qwen/Qwen2.5-7B-Instruct --- Reproduces the core idea of [AgentFlow](https://arxiv.org/abs/2510.05592): extending single-step LLM inference into a multi-turn **Planner → Executor → Verifier** agent loop, applying RL signals (GRPO) to the Planner's generation trajectory. This allows the model to improve its tool-use and reasoning capabilities without requiring manually annotated intermediate steps. Our code hub: https://github.com/LMIS-ORG/slime-agentic?tab=readme-ov-file #### Architecture ``` Input question │ ▼ Planner.plan() ← Analyze the problem and devise a solution strategy (loss_mask=1) │ └─► for step in range(max_steps): │ ├─ Planner.generate_next_step() ← Select next tool and sub-goal (loss_mask=1) ├─ Executor.generate_tool_command() │ + execute_command() ← Invoke tool (excluded from sequence) ├─ Verifier.verificate_context() ← Decide whether to continue (excluded) └─ Memory.add_action() ← Record execution result │ ▼ Planner.generate_final_output() ← Summarize results and produce final answer (loss_mask=0) │ ▼ Rewarder.compute_reward() ← LLM-as-Judge: compare model answer with ground truth ``` #### Tools (`tools/`) | Tool | Description | |---|---| | `base_generator` | General-purpose text generation tool; answers sub-tasks directly via LLM | | `python_coder` | Python code generation and execution tool for math computation and algorithmic problem solving | #### Results | Model | Dataset | Baseline | AgentFlow (Ours) | Improvement | |---|---|---|---|---| | Qwen2.5-7B-Instruct | AIME 2024 | 10.0% | 30.0% | +20.0% | > **Note:** Due to limited training resources, the AgentFlow model was only trained for 100 steps.