# ToolOrchestratorEnv — Research Document

**Authors:** Andrew Lara (Franklin and Marshall College); Yashaswi Sharma, Defu Cao, Muyan Weng (University of Southern California)
**Built on:** [SearchEconomicsEnv](https://github.com/sharma-yash01/SearchEconomicsEnv)
**Live environment:** https://huggingface.co/spaces/landrew9/ToolOrchestratorEnv

**Submission blog:** https://huggingface.co/spaces/landrew9/ToolOrchestratorEnv-Blog

**GitHub:** https://github.com/laraandrew/ToolOrchestratorEnv

---

## Table of Contents

1. [The Problem We're Solving](#1-the-problem-were-solving)
2. [Why Reinforcement Learning?](#2-why-reinforcement-learning)
3. [The Environment: How It Works](#3-the-environment-how-it-works)
4. [The Six Tools](#4-the-six-tools)
5. [The Reward Formula — Deep Dive](#5-the-reward-formula--deep-dive)
6. [The Datasets](#6-the-datasets)
7. [Answer Grading](#7-answer-grading)
8. [The Baselines](#8-the-baselines)
9. [Where This Came From: SearchEconomicsEnv](#9-where-this-came-from-searcheconomicsenv)
10. [What a Trained Agent Should Learn](#10-what-a-trained-agent-should-learn)
11. [File-by-File Reference](#11-file-by-file-reference)

---

## 1. The Problem We're Solving

Modern AI agents have access to tools — search engines, calculators, code runners, databases. Every real-world deployment charges for these tools in some way: API fees, latency, rate limits, compute time. But almost every existing RL benchmark treats tools as **free and unlimited**.

This creates a gap between research and reality. An agent trained on "use whatever tools you want" will behave terribly in production where every call costs money.

**ToolOrchestratorEnv closes that gap.**

The agent is given a fixed **budget** (default: 50 cost units) to spend across 10 questions. Every tool call deducts from that budget. The agent must decide:

- *Which tool is worth calling for this question?*
- *How many times should I call tools before just committing an answer?*
- *Is it worth spending 2.0 on an LLM call, or can a 0.1 calculator solve this?*

This is the same tradeoff a human researcher faces every day.

### Current Research Contribution

This submission contributes a complete, deployed OpenEnv-compatible environment rather than a claimed converged policy. The completed work includes:

- A FastAPI/OpenEnv environment with reset, step, health, tool manifest, browser demo, and concurrent session support.
- Six implemented tools with explicit heterogeneous costs: Ceramic search, Wikipedia lookup, calculator, Python executor, LLM reasoning, and commit.
- A deterministic reward implementation with step costs, Exact Match / token F1 answer grading, a quality-gated efficiency bonus, and a shared 10-question episode budget.
- Three reference baselines: random tool selection, cheapest-first routing, and a domain-aware oracle.
- Unit tests covering API behavior, tool behavior, sandbox restrictions, and core integration paths.

We do **not** claim a converged GRPO checkpoint in the current submission. The environment is built so that GRPO training can now test whether learned policies beat the domain oracle on cost-adjusted reward.

We also do **not** include training logs from Env Factory yet. The submission environment requires repeated structured multi-tool calls across each episode, and we were unable to make that multi-tool action flow reliable inside the current Env Factory integration path before the deadline. The planned next step is to continue experimentation through post-training once Env Factory stabilizes for this interaction pattern and more compatible model series are available.

---

## 2. Why Reinforcement Learning?

RL is the right framework here for three reasons:

**Delayed rewards.** You don't know if a tool call was helpful until you commit your answer at the end of a question. The agent must learn to assign credit backwards — "that search 3 steps ago is why I got this right."

**Exploration.** The agent must try different tool combinations to discover which work best per domain. Supervised learning can't teach this because there's no labeled "correct tool sequence" — there are many valid strategies.

**Multi-step planning.** Each episode has 10 questions and a shared budget. A good agent doesn't just optimize one question — it plans across the whole episode, knowing that spending too much early leaves nothing for later.

---

## 3. The Environment: How It Works

An **episode** works like this:

```
START EPISODE
  Budget = 50.0 units
  Draw 10 questions (mix of domains)

  FOR each question:
    Show agent: question text, domain, remaining budget, context window

    LOOP:
      Agent picks a tool and sends a query
      Environment runs the tool, charges the cost, returns results
      Results are added to the agent's context window for this question

      IF agent calls "commit":
        Grade the answer (Exact Match + Token F1)
        Calculate reward
        Clear context window
        Move to next question
        BREAK

      IF steps_on_this_question >= 8:
        Auto-advance (no commit reward bonus)
        BREAK

      IF budget_spent >= total_budget:
        Episode ends immediately

END EPISODE
```

The agent sees an **observation** at every step containing:
- The question text and domain tag
- How much budget is left and what fraction of the total that is
- What tools it already called on this question and what they returned
- How many questions remain
- Running accuracy so far

---

## 4. The Six Tools

Each tool is a Python function that takes the agent's action and returns a result. Here's exactly what each one does, technically:

---

### `ceramic_search` — Cost: **1.0**

**What it does:** Sends a POST request to the [Ceramic AI](https://ceramic.ai) search API (`https://api.ceramic.ai/search`) with the agent's query. Returns up to 5 web results with title, URL, and a description snippet.

**Best for:** HotpotQA questions that require finding factual information spread across multiple web sources.

**Technical note:** No pagination parameter is supported — Ceramic returns up to 10 results per call, we slice to the top 5. The API requires only `{"query": "..."}` in the POST body.

**Fallback:** If no `CERAMIC_API_KEY` is set, `FallbackCeramicClient` generates deterministic fake results using SHA-256 hashing — so tests are reproducible offline.

```python
# What the tool receives:
action.query = "Who was the first person to walk on the moon?"

# What it sends to Ceramic:
POST https://api.ceramic.ai/search
{"query": "Who was the first person to walk on the moon?"}

# What the agent gets back (in context_window):
"[ceramic_search] **NASA Moon Landing** (score: 892.3)
Neil Armstrong became the first human to step onto the Moon...

**Apollo 11 Mission** (score: 741.1)
On July 20, 1969, Neil Armstrong and Buzz Aldrin landed..."
```

---

### `wiki_lookup` — Cost: **0.5**

**What it does:** Hits the Wikipedia REST API (`https://en.wikipedia.org/api/rest_v1/page/summary/{title}`) and returns the first paragraph of the article for the queried topic. No API key required — Wikipedia is free.

**Best for:** Factual entity lookups where you know the subject (e.g., "Albert Einstein", "World War II"). Cheaper than search and more reliable for well-known topics.

**Technical note:** The query string becomes the Wikipedia article title (spaces replaced with underscores). Returns a 404 error result if the article doesn't exist — the agent can then try a different query.

```python
action.query = "William Shakespeare"
# Hits: https://en.wikipedia.org/api/rest_v1/page/summary/William_Shakespeare
# Returns: "William Shakespeare was an English playwright, poet and actor..."
```

---

### `calculator` — Cost: **0.1**

**What it does:** Evaluates a math expression safely using Python's `ast` (Abstract Syntax Tree) module. The expression is parsed into a tree structure, and only pre-approved operations are allowed (addition, subtraction, multiplication, division, power, modulo, comparisons, and common math functions like `sqrt`, `log`, `sin`, `cos`).

**Why not just use `eval()`?** Because `eval("__import__('os').system('rm -rf /')")` would delete your hard drive. The AST approach means the code is never executed — it's parsed into a data structure and we only compute what we explicitly allow.

**Best for:** MATH competition problems, any arithmetic, symbolic computations.

```python
action.expression = "sqrt(144) + 3 * 7"
# Returns: "23.0"

action.expression = "2 ** 10"
# Returns: "1024"

action.expression = "import os"  # BLOCKED — not a valid math expression
# Returns: "[Calc error: Unsupported AST node: Import]"
```

---

### `code_executor` — Cost: **0.3**

**What it does:** Runs Python code in a sandboxed `exec()` environment for intended coding tasks. Captures whatever is printed to stdout and returns it as the result.

**Security model:** Blocks import statements, dangerous builtin names such as `open`, `eval`, `exec`, `globals`, and obvious object-graph escape paths such as dunder attribute traversal. Only a curated builtin/module surface is exposed.

**Best for:** HumanEval coding tasks where the agent needs to actually run code to verify correctness.

```python
action.code_snippet = """
def fibonacci(n):
    if n <= 1: return n
    return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(10))
"""
# Returns: "55"
```

---

### `llm_reason` — Cost: **2.0**

**What it does:** Sends the query to a large language model via the [Together AI](https://together.ai) API (default model: `meta-llama/Llama-3-8b-chat-hf`) and returns up to 512 tokens of chain-of-thought reasoning.

**Why so expensive?** It reflects reality — calling a hosted LLM costs real money per token. The 2.0 cost means the agent burns through 4% of its total episode budget on a single LLM call. It should only do this when genuinely necessary.

**Best for:** GPQA graduate-level science problems where factual retrieval isn't enough and actual reasoning is required.

**Graceful fallback:** If `TOGETHER_API_KEY` is not set, returns a clear error message instead of crashing. The agent learns to avoid this tool when it's unavailable.

**Tool routing note:** The environment exposes the canonical tool manifest at `GET /tools`, and tool dispatch normalizes missing-tool and tool-crash cases into explicit `ToolResult` errors. That keeps the OpenEnv-style contract stable even when a backing service is missing.

---

### `commit` — Cost: **0.0**

**What it does:** Submits the agent's answer for grading. This is always free — there's no penalty for committing, only for being wrong.

When commit is called:
1. The answer is extracted from `action.answer`
2. It's graded against the ground truth (see Section 7)
3. A reward is computed (see Section 5)
4. The context window is cleared
5. The episode advances to the next question

**Strategic note:** The agent should commit as soon as it's confident. Every extra tool call after reaching sufficient confidence is wasted budget.

---

## 5. The Reward Formula — Deep Dive

This is the core intellectual contribution of the environment. The reward has two components.

### Part 1: Step Reward (every tool call)

```
R_step = -tool_cost
```

Every time the agent calls a tool (including commit, which has cost 0), it gets a negative reward equal to the tool's cost. This creates constant pressure to be efficient.

| Tool | Step Reward |
|---|---|
| `ceramic_search` | -1.0 |
| `wiki_lookup` | -0.5 |
| `calculator` | -0.1 |
| `code_executor` | -0.3 |
| `llm_reason` | -2.0 |
| `commit` | 0.0 |

### Part 2: Commit Reward (on submit)

```
R_commit = base + bonus

base  = incorrect_reward + quality × (correct_reward − incorrect_reward)
      = -0.5 + quality × 1.5

bonus = η × γ × budget_remaining_ratio
      η = 1  if quality ≥ 0.5, else 0
      γ = 0.1  (efficiency weight)
      budget_remaining_ratio = remaining_budget / total_budget
```

Let's walk through what this means with real examples.

---

**Example A: Correct answer, lots of budget left**

The agent uses one `calculator` call (cost 0.1) and commits the right answer with 49.9 budget remaining out of 50.

```
quality = 1.0  (exact match)
base    = -0.5 + 1.0 × 1.5 = 1.0
η       = 1  (quality 1.0 ≥ threshold 0.5)
bonus   = 1 × 0.1 × (49.9/50) = 0.0998
R_step  = -0.1  (from the calculator call)

Total reward this question = -0.1 + 1.0 + 0.0998 = +0.9998
```

**Example B: Correct answer, lots of budget spent**

The agent uses three `ceramic_search` calls (cost 3.0 total) and commits right.

```
quality = 1.0
base    = 1.0
η       = 1
bonus   = 1 × 0.1 × (47/50) = 0.094
R_steps = -3.0  (three search calls)

Total = -3.0 + 1.0 + 0.094 = -1.906
```

Same correct answer, but much worse reward because of wasted tool calls.

**Example C: Wrong answer**

```
quality = 0.0  (wrong)
base    = -0.5 + 0.0 × 1.5 = -0.5
η       = 0  (quality 0.0 < threshold 0.5)  — no efficiency bonus
bonus   = 0

Total = R_steps + (-0.5)
```

**Example D: Partially correct answer (F1 = 0.6)**

```
quality = 0.6  (F1 score, close but not exact)
base    = -0.5 + 0.6 × 1.5 = 0.4
η       = 1  (0.6 ≥ 0.5)
bonus   = 1 × 0.1 × budget_ratio
```

Partial credit exists. This encourages the agent to try even when uncertain rather than just passing.

---

### Why this formula shape?

**The efficiency bonus gate (η):** The bonus only applies when quality ≥ 0.5. This prevents a degenerate strategy where the agent commits immediately with a random guess, earns a small efficiency bonus for having tons of budget left, and never actually tries to answer correctly.

**The linear quality scaling:** Rather than binary right/wrong, the agent gets gradual signal. Answering "Neil Armstrong" when the answer is "Neil Armstrong, Buzz Aldrin" gets partial credit. This makes learning easier because there's always a gradient to follow.

**The budget remaining ratio:** As budget drains, each correct answer is worth slightly less in efficiency bonus. This pushes the agent to be consistently frugal across all questions, not just on the last one.

---

## 6. The Datasets

Questions are drawn from four HuggingFace datasets, mixed according to `domain_mix` (default: 40/30/20/10):

### HotpotQA (40% of questions)
- **What it is:** Multi-hop Wikipedia questions that require connecting information from two or more sources.
- **Example:** *"What government position was held by the woman who portrayed Corliss Archer in the radio series?"*
- **Why it's hard:** A single search won't answer it. You need to find who played Corliss Archer, then look up that person's government role.
- **Best tool:** `ceramic_search` or `wiki_lookup` (multiple calls)
- **HuggingFace:** `hotpotqa/hotpot_qa`

### MATH (30% of questions)
- **What it is:** Competition mathematics problems at difficulty levels 3-5 (out of 5). These are AMC/AIME-style problems.
- **Example:** *"Find all real solutions to x³ - 6x² + 11x - 6 = 0"*
- **Why it's hard:** Requires algebraic reasoning, not just lookup. Level 5 problems stump most college students.
- **Best tool:** `calculator` for arithmetic steps, `code_executor` for complex algebra, `llm_reason` for symbolic reasoning
- **HuggingFace:** `DigitalLearningGmbH/MATH-lighteval`

### GPQA (20% of questions)
- **What it is:** Graduate-level questions in biology, chemistry, and physics, written by PhD students and vetted by domain experts.
- **Example:** *"Which of the following correctly describes the role of the TATA-binding protein in eukaryotic transcription initiation?"*
- **Why it's hard:** Even searching the exact question won't help — you need to understand the underlying science. This is where `llm_reason` earns its 2.0 cost.
- **Best tool:** `llm_reason`, possibly with a `ceramic_search` for context
- **HuggingFace:** `Idavidrein/gpqa` (gated — requires HF_TOKEN)

### HumanEval (10% of questions)
- **What it is:** Python programming tasks with a function signature and docstring. The agent must complete the function.
- **Example:** *"def is_palindrome(string: str) -> bool: \n    \"\"\"Test if a string is a palindrome.\"\"\""*
- **Why it's hard:** The agent needs to produce syntactically correct, logically correct code — and ideally run it to verify.
- **Best tool:** `code_executor` (write code, run it, verify output), `llm_reason` (generate code via LLM)
- **HuggingFace:** `openai/openai_humaneval`

---

## 7. Answer Grading

Grading happens in `env/answer_grading.py` when the agent calls `commit`.

### Step 1: Answer Extraction

The agent's answer string often contains extra text. We extract the actual answer using this priority chain:

1. **Strip markdown code fences** — remove ` ``` ` blocks
2. **Try JSON parsing** — if the answer looks like `{"answer": "Paris"}`, extract the `answer` field
3. **Prefix matching** — look for patterns like `"Answer: Paris"` or `"Final answer: Paris"`
4. **Last line fallback** — take the last non-empty line of the response

### Step 2: Normalization

Both the extracted answer and the ground truth are normalized the same way:
1. Lowercase everything
2. Remove articles (`a`, `an`, `the`)
3. Remove all punctuation
4. Split into tokens (words)

*Example:* `"The United States of America."` → `["united", "states", "america"]`

### Step 3: Exact Match (EM)

Are the normalized token lists identical?

```
predicted: "neil armstrong"    → ["neil", "armstrong"]
gold:      "Neil Armstrong"    → ["neil", "armstrong"]
EM = True ✓

predicted: "armstrong"         → ["armstrong"]
gold:      "Neil Armstrong"    → ["neil", "armstrong"]
EM = False ✗
```

### Step 4: Token F1

Token F1 measures overlap. It counts how many tokens appear in both the prediction and the ground truth, then computes precision and recall.

```
F1 = 2 × precision × recall / (precision + recall)

precision = common_tokens / predicted_tokens
recall    = common_tokens / gold_tokens
```

*Example:*
```
predicted: ["neil", "armstrong", "astronaut"]
gold:      ["neil", "armstrong"]

common  = ["neil", "armstrong"]  (2 tokens)
precision = 2/3 = 0.667
recall    = 2/2 = 1.0
F1 = 2 × 0.667 × 1.0 / (0.667 + 1.0) = 0.8
```

### Step 5: Quality Score

```python
quality = 1.0  if exact_match
quality = f1   otherwise
```

The quality score feeds directly into the reward formula. A perfect answer gets quality=1.0, a partial answer might get 0.4-0.8, and a completely wrong answer gets 0.0.

---

## 8. The Baselines

Three baseline policies ship with the environment. They don't learn — they follow fixed rules. Their purpose is to establish performance floors that a trained RL agent should beat.

### Random Tool (`baselines/random_tool.py`)

Picks a tool uniformly at random from the 5 non-commit tools each step. Commits after 3 steps with "I don't know."

**Expected behavior:** Wastes budget on expensive tools (llm_reason, ceramic_search) even for simple math problems. Gets ~20-30% accuracy by luck. Terrible budget efficiency.

**Why it exists:** Sets the absolute floor. Any RL agent that can't beat random is broken.

---

### Cheapest First (`baselines/cheapest_first.py`)

Calls tools in order of ascending cost: calculator (0.1) → code_executor (0.3) → wiki_lookup (0.5) → ceramic_search (1.0) → llm_reason (2.0). Commits after exhausting its call budget.

**Expected behavior:** Great budget efficiency. Terrible accuracy on HotpotQA and GPQA because it always tries the calculator first even on factual questions.

**Why it exists:** Shows that frugality alone isn't the answer. You need to route by domain, not just price.

---

### Domain Oracle (`baselines/oracle.py`)

Uses a hardcoded domain-to-tool mapping:
- HotpotQA → `ceramic_search`, then `wiki_lookup`
- MATH → `calculator`, then `llm_reason`
- GPQA → `llm_reason`, then `ceramic_search`
- HumanEval → `code_executor`, then `llm_reason`

Commits after exhausting its domain-specific sequence.

**Expected behavior:** Best accuracy of the three baselines. Still suboptimal because it doesn't adapt within a domain or manage budget across questions.

**Why it exists:** The performance ceiling for non-learning approaches. A trained RL agent should eventually exceed this by learning subtler patterns.

---

## 9. Where This Came From: SearchEconomicsEnv

ToolOrchestratorEnv is a direct generalization of [SearchEconomicsEnv](https://github.com/sharma-yash01/SearchEconomicsEnv), built by Yashaswi Sharma (University of Southern California) and Ceramic AI.

SearchEconomicsEnv posed a simpler version of the same question: given a fixed number of **search calls**, can an RL agent learn to answer HotpotQA questions efficiently? It used one tool (search), one dataset (HotpotQA), and Weitzman-style budget penalties.

The insight from that work was that agents could learn non-trivial search strategies — knowing when one search was enough versus when multiple hops were needed.

ToolOrchestratorEnv asks the harder question: **can the same principle scale to multiple tools and multiple domains?** Instead of "how many searches do I make?", the question becomes "which tool do I pick for this type of question, at this point in my budget?"

| | SearchEconomicsEnv | ToolOrchestratorEnv |
|---|---|---|
| Tools | 1 (search) | 6 (search, wiki, calc, code, LLM, commit) |
| Datasets | HotpotQA only | HotpotQA + MATH + GPQA + HumanEval |
| Budget unit | # of search calls | cost units per tool |
| Core challenge | How many searches? | Which tool, when? |
| Retrieval backend | Ceramic AI | Ceramic AI (shared) |

---

## 10. What a Trained Agent Should Learn

A well-trained agent should exhibit these behaviors:

**Domain routing.** When the question is math, skip search and go straight to calculator. When it's factual multi-hop, start with search. When it's graduate science, bite the bullet on llm_reason.

**Confidence-based committing.** If the calculator returned a clean number and the question was arithmetic, commit immediately. Don't spend another 0.5 on a Wikipedia lookup you don't need.

**Budget awareness.** In early questions with plenty of budget, it's okay to use ceramic_search. By question 8, with only 5 units left and 3 questions remaining, switch to calculator-only even for non-math questions.

**Failure recovery.** If the first tool call returns garbage (wrong article, irrelevant search results), try a different tool rather than committing a bad answer.

**These are the behaviors that baselines can't exhibit** — they require learning from feedback across thousands of episodes, which is exactly what RL provides.

---

## 11. File-by-File Reference

```
ToolOrchestratorEnv/
│
├── app.py
│   The FastAPI web server. Handles /reset, /step, /health, /tools, /web.
│   Multi-session: each /reset returns a session_id used in /step.
│   Lazily loads the dataset and exposes the canonical tool manifest.
│
├── openenv.yaml
│   Deployment spec for the OpenEnv competition framework.
│
├── Dockerfile
│   Builds the HuggingFace Space container. Runs uvicorn on port 8000.
│
├── env/
│   ├── environment.py
│   │   ToolOrchestratorEnvironment class. The main game loop.
│   │   reset() — starts episode, samples questions, zeroes budget.
│   │   step()  — validates action, dispatches tool, charges cost,
│   │             handles commit (grade + reward), manages episode end.
│   │   _make_obs() — assembles the observation dict from current state.
│   │   _sample_questions() — stratified sampling from dataset by domain.
│   │
│   ├── models.py
│   │   Pydantic types that define the agent-environment interface:
│   │   OrchestratorAction, ToolResult, OrchestratorObservation, OrchestratorState
│   │   TOOL_IDS — the canonical list of valid tool names.
│   │
│   ├── config.py
│   │   EnvConfig dataclass. All tuneable parameters:
│   │   total_budget, num_questions, tool_costs, domain_mix, reward weights.
│   │
│   ├── answer_grading.py
│   │   grade(predicted, gold) → (exact_match, f1, quality)
│   │   normalize_answer(), exact_match(), token_f1(), extract_answer()
│   │
│   └── reward.py
│       step_reward(tool_id, config) → negative cost
│       commit_reward(quality, budget_ratio, config) → composite score
│
├── ceramic/
│   └── client.py
│       CeramicClient — live API calls to https://api.ceramic.ai/search
│       FallbackCeramicClient — deterministic offline fake results
│       get_ceramic_client() — reads CERAMIC_API_KEY env var, returns right client
│
├── data/
│   └── loader.py
│       load_all(split, max_per_domain) — loads from HuggingFace datasets.
│       Falls back to synthetic questions if a dataset is unavailable.
│       Returns flat List[Dict] with 'domain' key on each item.
│
├── tools/
│   ├── runtime.py        Tool catalog, validation, and explicit dispatch
│   ├── __init__.py       build_tool_registry() + tool manifest helpers
│   ├── ceramic_search.py make_search_tool() factory wrapping CeramicClient
│   ├── wiki_lookup.py    Wikipedia REST API, first paragraph
│   ├── calculator.py     Safe AST-based math eval with comparisons
│   ├── code_executor.py  Sandboxed exec with blocked imports and dunder escapes
│   ├── llm_reason.py     Together AI API, graceful fallback
│   └── commit.py         Pass-through; grading is in environment.py
│
└── baselines/
    ├── random_tool.py    Uniform random tool selection
    ├── cheapest_first.py Always picks cheapest tool first
    └── oracle.py         Domain-aware heuristic routing
```

---

*This document describes the environment as implemented. For the blog post draft (with academic citations and related work), see `BLOG_PROMPT.md`.*