File size: 1,885 Bytes
abf58ec
fff73c7
 
abf58ec
fff73c7
a31a38c
33024f3
 
fff73c7
a31a38c
 
fff73c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a31a38c
 
fff73c7
a31a38c
fff73c7
 
 
 
a31a38c
fff73c7
a31a38c
 
 
fff73c7
 
 
a31a38c
fff73c7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
base_model:
- Qwen/Qwen2.5-7B-Instruct
---
Reproduces the core idea of [AgentFlow](https://arxiv.org/abs/2510.05592): extending single-step LLM inference into a multi-turn **Planner β†’ Executor β†’ Verifier** agent loop, applying RL signals (GRPO) to the Planner's generation trajectory. This allows the model to improve its tool-use and reasoning capabilities without requiring manually annotated intermediate steps.

Our code hub:
https://github.com/LMIS-ORG/slime-agentic?tab=readme-ov-file
#### Architecture

```
Input question
  β”‚
  β–Ό
Planner.plan()              ← Analyze the problem and devise a solution strategy (loss_mask=1)
  β”‚
  └─► for step in range(max_steps):
        β”‚
        β”œβ”€ Planner.generate_next_step()             ← Select next tool and sub-goal (loss_mask=1)
        β”œβ”€ Executor.generate_tool_command()
        β”‚  + execute_command()                       ← Invoke tool (excluded from sequence)
        β”œβ”€ Verifier.verificate_context()             ← Decide whether to continue (excluded)
        └─ Memory.add_action()                       ← Record execution result
  β”‚
  β–Ό
Planner.generate_final_output()   ← Summarize results and produce final answer (loss_mask=0)
  β”‚
  β–Ό
Rewarder.compute_reward()         ← LLM-as-Judge: compare model answer with ground truth
```

#### Tools (`tools/`)

| Tool | Description |
|---|---|
| `base_generator` | General-purpose text generation tool; answers sub-tasks directly via LLM |
| `python_coder` | Python code generation and execution tool for math computation and algorithmic problem solving |

#### Results



| Model | Dataset | Baseline | AgentFlow (Ours) | Improvement |
|---|---|---|---|---|
| Qwen2.5-7B-Instruct | AIME 2024 | 10.0% | 30.0% | +20.0% |

> **Note:** Due to limited training resources, the AgentFlow model was only trained for 100 steps.