Spaces:

landrew9
/

ToolOrchestratorEnv

Sleeping

Andrew Lara Claude Sonnet 4.6 commited on Apr 13

Commit

8ca3a35

0 Parent(s):

Initial implementation of ToolOrchestratorEnv

Multi-tool cost-aware RL environment built on SearchEconomicsEnv.
Agent selects from 6 tools (ceramic_search, wiki_lookup, calculator,
code_executor, llm_reason, commit) across 4 QA domains under a shared
budget constraint. Weitzman-style composite reward (quality + efficiency
bonus). OpenEnv-compatible FastAPI server, Dockerfile for HF Spaces.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (31) hide show

.env.example +11 -0
.gitignore +20 -0
BLOG_PROMPT.md +115 -0
Dockerfile +29 -0
README.md +213 -0
app.py +208 -0
baselines/__init__.py +0 -0
baselines/cheapest_first.py +69 -0
baselines/oracle.py +76 -0
baselines/random_tool.py +67 -0
ceramic/__init__.py +3 -0
ceramic/client.py +133 -0
client.py +123 -0
data/__init__.py +0 -0
data/loader.py +270 -0
env/__init__.py +0 -0
env/answer_grading.py +91 -0
env/config.py +53 -0
env/environment.py +299 -0
env/models.py +111 -0
env/reward.py +57 -0
environment.py +307 -0
openenv.yaml +6 -0
requirements.txt +8 -0
tools/__init__.py +29 -0
tools/calculator.py +74 -0
tools/ceramic_search.py +34 -0
tools/code_executor.py +63 -0
tools/commit.py +12 -0
tools/llm_reason.py +43 -0
tools/wiki_lookup.py +38 -0

.env.example ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copy this file to .env and fill in your values.
+# On HuggingFace Spaces, set these as Space Secrets (Settings → Variables and secrets).
+# Required: Ceramic AI search key (sign up at https://ceramic.ai)
+CERAMIC_API_KEY=cer_sk_live_...
+# Optional: Together AI key for the llm_reason tool (https://api.together.xyz)
+TOGETHER_API_KEY=
+# HuggingFace token — only needed if you load gated datasets (e.g. GPQA)
+HF_TOKEN=

.gitignore ADDED Viewed

	@@ -0,0 +1,20 @@

+# Never commit real secrets
+.env
+# Python
+__pycache__/
+*.pyc
+*.pyo
+.venv/
+dist/
+*.egg-info/
+# Data / cache
+data/*.jsonl
+data/*.npy
+data/*.json
+# Editor
+.DS_Store
+.idea/
+.vscode/

BLOG_PROMPT.md ADDED Viewed

	@@ -0,0 +1,115 @@

+# Blog Writing Prompt
+Drop this entire block into a fresh Claude conversation to generate the research blog post.
+---
+## PROMPT
+You are writing a research blog post about a reinforcement learning environment called **ToolOrchestratorEnv**. The authors are Andrew and Yash Sharma (USC), building on Yash's prior work on SearchEconomicsEnv. The target audience is ML researchers and practitioners who read papers like those at NeurIPS, ICLR, and the Hugging Face blog — people who understand RL and LLMs but are not experts in tool-use or search economics.
+The post should be **1,500–2,000 words**, well-structured with section headers, and written in a direct, confident academic-blog tone (think: The Gradient, Hugging Face blog, or a good arXiv blog post). Avoid hype. Let the ideas do the work.
+---
+### What this research is and why it matters
+**The core problem:** Large language model agents are increasingly given access to tools — search engines, calculators, code interpreters, databases. In real deployments, every tool call costs something: API fees, latency, rate limits, or compute. Current agent frameworks treat tools as free: call whatever you want, as many times as you want. This is unrealistic and economically wasteful.
+**The research gap:** Most RL environments for tool-using agents either (a) focus on a single tool (e.g. search-only retrieval agents), or (b) ignore cost entirely and measure only answer quality. There is no standard RL training ground where the agent must *choose between tools with different price/quality tradeoffs* under a shared budget constraint.
+**What we built:** ToolOrchestratorEnv — an OpenEnv-compatible RL environment that puts cost-aware tool selection at the center of the learning objective. The agent picks from six tools per step (web search, Wikipedia, calculator, Python executor, LLM reasoning, or commit) across four question domains (HotpotQA, MATH, GPQA, HumanEval), with a shared budget that depletes as tools are called.
+**Why this is novel:**
+- It extends the "search economics" framing from a single tool to a heterogeneous tool portfolio
+- It tests transfer: can an agent learn that calculators are cheap and LLMs expensive, and route accordingly?
+- The multi-domain setup forces the agent to learn *domain → tool* mappings (search for factual QA, calculator for math) rather than one-size-fits-all policies
+- The Weitzman-style reward (efficiency bonus only on correct + frugal commits) creates a richer credit assignment problem than binary success/failure
+---
+### Background sections to write (with sources to find and cite)
+**1. The tool-use agent landscape**
+Explain why tool use is now central to LLM agents. Cite and discuss:
+- The ReAct paper (Yao et al., 2022) — introduced interleaving reasoning and tool calls
+- Toolformer (Schick et al., 2023) — self-supervised tool learning
+- ToolBench / API-Bank — benchmarks for tool-using LLMs
+- Find at least one recent paper (2024 or 2025) showing that tool-calling agents outperform tool-free baselines on knowledge-intensive tasks. Look at arXiv, ACL Anthology, or the Hugging Face papers page.
+**2. Search economics and the budget constraint**
+Explain the economic analogy: information has a cost, and rational agents should not search more than their expected marginal gain from search. Cite:
+- Weitzman (1979) "Optimal Search for the Best Alternative" — the foundational search economics paper
+- SearchEconomicsEnv by Yash Sharma / USC (https://github.com/sharma-yash01/SearchEconomicsEnv, https://huggingface.co/spaces/yashu2000/search-economics-env) — the direct predecessor that built this RL environment for search-budget-constrained HotpotQA
+- Look for any recent work on "budgeted retrieval" or "adaptive retrieval" in RAG systems (2024-2025) that shows that unconstrained retrieval hurts performance or cost-effectiveness. Papers like FLARE, IterRetGen, or similar might be relevant.
+**3. The multi-domain challenge**
+Explain why testing across HotpotQA, MATH, GPQA, and HumanEval matters — these domains need fundamentally different tools (search for factual, calculator for symbolic, code for algorithmic, LLM for graduate-level). Find and cite:
+- The MATH benchmark paper (Hendrycks et al., 2021)
+- HotpotQA paper (Yang et al., 2018)
+- GPQA paper (Rein et al., 2023)
+- HumanEval paper (Chen et al., 2021)
+- Any paper showing that tool specialisation helps across domains (e.g., PAL, PoT, or similar)
+**4. Reinforcement learning for tool selection**
+Explain why RL (not just prompting or supervised learning) is the right frame for this problem: the agent must explore, face delayed rewards (only know if an answer was right after commit), and learn multi-step strategies. Cite:
+- Any recent paper using RL for LLM agent training (e.g., RLHF extensions, agent-specific RL work, or OpenEnv/AgentBench)
+- The OpenEnv competition framework (Berkeley RDI, AgentX) — explain what OpenEnv is and why standardised RL environments matter for reproducibility
+- Look for "process reward models" or "step-level reward" papers in the agent RL space
+---
+### Key sections for the post
+1. **The problem with free tools** — hook paragraph. Real API calls cost money. Agents don't know this. Set up the gap.
+2. **Search economics, briefly** — one paragraph on Weitzman, one on SearchEconomicsEnv. The framing: information retrieval as a market with prices.
+3. **ToolOrchestratorEnv: the environment** — describe the setup clearly:
+   - 6 tools, 4 datasets, shared budget
+   - The action-observation loop (what the agent sees, what it decides)
+   - The reward formula (explain it intuitively: you pay for every call, you earn back on correct commits, and get a bonus for answering correctly without blowing your budget)
+   - The Ceramic AI integration for live web retrieval
+4. **Why this is hard** — explain the credit assignment problem (you don't know a tool call was wasted until you commit), the domain-routing challenge, and the exploration-exploitation tradeoff under budget pressure.
+5. **Baselines and what they tell us** — describe the three baselines (random, cheapest-first, domain-oracle) and what their expected performance reveals about the structure of the problem.
+6. **What we're building toward** — the research agenda: train a PPO or DQN agent on this environment, show it beats baselines, and study what routing policies it learns. Can it learn that LLM reasoning is worth 2x the cost for GPQA but wasteful for simple arithmetic?
+7. **Conclusion** — the broader point: as AI systems become more agentic, cost-aware tool selection will be as important as answer quality. We need RL environments that take this seriously. This is one.
+---
+### Tone and style guidelines
+- **Cite real papers** — do not make up citations. For any claim about related work, search arXiv, Semantic Scholar, or ACL Anthology and use the actual paper. Format citations inline as (Author et al., Year) with a references section at the end.
+- **Be specific** — don't say "researchers have shown" without naming the paper.
+- **Write for skeptics** — assume your reader will ask "why does this matter" and "what's actually new." Answer those questions directly in the text.
+- **Avoid marketing language** — no "revolutionary," "groundbreaking," or "state-of-the-art." Just describe what was built and why it's useful.
+- **Include the reward formula** — write it out mathematically and then explain it in plain English. Researchers appreciate seeing the actual math.
+- **Link to the HF Space** — mention that the environment is live at https://huggingface.co/spaces/yashu2000/search-economics-env (SearchEconomicsEnv, the predecessor) and that ToolOrchestratorEnv will be deployed alongside it.
+---
+### What NOT to do
+- Do not fabricate benchmark numbers — we don't have trained agent results yet, only baseline results. Say so honestly.
+- Do not claim this is the first RL environment for tool use — be accurate about prior work.
+- Do not skip the related work — proving the gap is real requires engaging with existing papers.
+- Do not make the reward formula paragraph too short — this is a key technical contribution; spend time on it.
+---
+### Final checklist before finishing the post
+- [ ] Every citation is real and can be found on arXiv or a peer-reviewed venue
+- [ ] The reward formula is written out and explained in plain English
+- [ ] The post explains what OpenEnv is and why deploying on HF Spaces matters
+- [ ] The post mentions Ceramic AI and explains why live web retrieval matters (vs. static knowledge)
+- [ ] The baseline section sets up what "winning" looks like for a trained RL agent
+- [ ] A references section is included at the end with full citations

Dockerfile ADDED Viewed

	@@ -0,0 +1,29 @@

+# HuggingFace Spaces — ToolOrchestratorEnv
+# Builds a FastAPI server that exposes the OpenEnv endpoints.
+#
+# Required secret (set in Space Settings → Variables and secrets):
+#   CERAMIC_API_KEY=cer_sk_live_...
+#
+# Optional:
+#   TOGETHER_API_KEY=...  (enables the llm_reason tool)
+#   HF_TOKEN=...          (enables gated datasets like GPQA)
+FROM python:3.11-slim
+WORKDIR /app
+# Install dependencies first so Docker layer-caches them
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy source
+COPY . .
+# HuggingFace Spaces runs as a non-root user
+RUN useradd -m -u 1000 appuser && chown -R appuser /app
+USER appuser
+EXPOSE 8000
+# The Space README sets base_path: /web so the demo UI loads on open
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md ADDED Viewed

	@@ -0,0 +1,213 @@

+---
+title: Tool Orchestrator Environment
+emoji: 🔧
+colorFrom: blue
+colorTo: purple
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+  - reinforcement-learning
+  - tool-use
+  - cost-aware
+---
+# ToolOrchestratorEnv
+**An OpenEnv-compatible reinforcement learning environment for multi-tool, cost-aware question answering.**
+Built on top of [SearchEconomicsEnv](https://huggingface.co/spaces/yashu2000/search-economics-env) (Yash Sharma, USC / Ceramic AI), this environment generalises the single-tool (search-only) formulation to a full **tool-selection problem**: the agent must choose *which* of six tools to call at each step, managing a shared cost budget across a multi-domain question set (HotpotQA, MATH, GPQA, HumanEval).
+The core research question: **can an RL agent learn a cost-aware tool routing policy that outperforms simple heuristics like "always search" or "always use the cheapest tool"?**
+---
+## What the agent learns
+Each episode the agent receives 10 questions sampled across four domains. At every step it sees:
+- The current **question** and its **domain** tag
+- Its **remaining budget** (shared across all questions)
+- The **context window** — concatenated outputs from prior tool calls on this question
+It picks one action from six tools:
+| Tool | `tool_id` | Cost | Best for |
+|---|---|---|---|
+| Ceramic web search | `ceramic_search` | 1.0 | Multi-hop factual QA |
+| Wikipedia lookup | `wiki_lookup` | 0.5 | Entity facts, definitions |
+| Calculator | `calculator` | 0.1 | Arithmetic, symbolic math |
+| Python executor | `code_executor` | 0.3 | HumanEval code tasks |
+| LLM reasoning | `llm_reason` | 2.0 | Graduate-level GPQA problems |
+| Commit answer | `commit` | 0.0 | Submit and move to next question |
+**The RL objective:** maximise accuracy across all questions while staying within the total budget — learning *which tool to call*, in *which order*, and *when to stop and commit*.
+---
+## Reward formula
+```
+On tool call:   R = -tool_cost
+On commit:      R = base + η · γ · budget_remaining_ratio
+  base     = incorrect_reward + quality · (correct_reward − incorrect_reward)
+  quality  = max(ExactMatch, TokenF1)
+  η        = 1  if quality ≥ efficiency_bonus_threshold, else 0
+  γ        = efficiency_bonus_weight
+```
+The efficiency bonus is only awarded when the agent answers correctly **and** still has budget remaining — directly incentivising both accuracy and frugality.
+---
+## Quickstart (local)
+```bash
+# 1. Clone and install
+git clone <this-repo>
+cd claude_toolOrchestrator
+pip install -r requirements.txt
+# 2. Configure keys (copy the example and fill in values)
+cp .env.example .env
+# Set CERAMIC_API_KEY — sign up free at https://ceramic.ai
+# 3. Start the server
+uvicorn app:app --port 8000
+# 4. Try the interactive demo UI
+open http://localhost:8000/web
+# or browse the full OpenAPI spec at
+open http://localhost:8000/docs
+```
+---
+## HTTP API
+### `POST /reset`
+Start a new episode. Returns `session_id`, initial `observation`, and `state`.
+```json
+{ "seed": 42, "config_overrides": { "total_budget": 30.0, "num_questions": 5 } }
+```
+### `POST /step?session_id=<id>`
+Execute one tool call. Pass `session_id` (from `/reset`) as a query param to support parallel agents.
+```json
+{ "tool_id": "ceramic_search", "query": "When was the Eiffel Tower built?" }
+{ "tool_id": "calculator",     "expression": "sqrt(144) + 3" }
+{ "tool_id": "code_executor",  "code_snippet": "print(2 ** 10)" }
+{ "tool_id": "commit",         "answer": "1889" }
+```
+### `GET /health`
+Returns `{"status": "ok"}`.
+---
+## Project layout
+```
+claude_toolOrchestrator/
+│
+├── app.py                  # FastAPI server — multi-session, OpenAPI, demo UI
+├── openenv.yaml            # OpenEnv deployment spec
+├── requirements.txt        # Python dependencies
+├── .env.example            # Key template (copy → .env, never commit .env)
+│
+├── env/                    # ── Core RL environment ──────────────────────────
+│   ├── environment.py      # ToolOrchestratorEnvironment: reset() + step()
+│   ├── models.py           # Pydantic types: Action, Observation, State, ToolResult
+│   ├── config.py           # EnvConfig dataclass: budget, costs, reward weights
+│   ├── answer_grading.py   # grade() → (exact_match, f1, quality)
+│   └── reward.py           # step_reward() + commit_reward()
+│
+├── ceramic/                # ── Retrieval backend ────────────────────────────
+│   └── client.py           # CeramicClient (live) + FallbackCeramicClient (offline)
+│
+├── data/                   # ── Dataset loading ──────────────────────────────
+│   └── loader.py           # load_all() → flat list from 4 HF datasets
+│
+├── tools/                  # ── Six tool implementations ─────────────────────
+│   ├── ceramic_search.py   # Web search (Ceramic AI API)
+│   ├── wiki_lookup.py      # Wikipedia REST API, first paragraph
+│   ├── calculator.py       # Safe AST-based math evaluator (no exec)
+│   ├── code_executor.py    # Sandboxed Python exec (blocks os/sys/subprocess)
+│   ├── llm_reason.py       # Together AI chain-of-thought (graceful fallback)
+│   └── commit.py           # Answer pass-through; grading runs in environment
+│
+└── baselines/              # ── Reference policies ───────────────────────────
+    ├── random_tool.py      # Uniform random tool selection
+    ├── cheapest_first.py   # Always picks cheapest non-commit tool first
+    └── oracle.py           # Domain-aware heuristic (search for QA, calc for math)
+```
+---
+## Environment variables
+| Variable | Required | Description |
+|---|---|---|
+| `CERAMIC_API_KEY` | Yes (for live search) | Ceramic AI key — `POST /search` endpoint |
+| `SEE_CERAMIC_API_KEY` | Alternative | HF Spaces alias used by SearchEconomicsEnv |
+| `TOGETHER_API_KEY` | Optional | Enables the `llm_reason` tool via Together AI |
+| `HF_TOKEN` | Optional | Required only to load gated datasets (GPQA) |
+If no Ceramic key is set, `ceramic_search` falls back to deterministic offline results; all other tools work without any key.
+---
+## Running baselines
+```bash
+# From inside claude_toolOrchestrator/
+python -m baselines.random_tool
+python -m baselines.cheapest_first
+python -m baselines.oracle
+```
+---
+## Relation to SearchEconomicsEnv
+| | [SearchEconomicsEnv](https://github.com/sharma-yash01/SearchEconomicsEnv) | ToolOrchestratorEnv |
+|---|---|---|
+| Tools available | 1 (search only) | 6 (search, wiki, calc, code, LLM, commit) |
+| Datasets | HotpotQA | HotpotQA + MATH + GPQA + HumanEval |
+| Budget unit | # of search calls | cost units per tool (tool-specific) |
+| Reward shape | Weitzman search penalty | Same formula, extended to tool costs |
+| Core RL challenge | *How many* searches to do | *Which* tool to call, in which order |
+| Retrieval backend | Ceramic AI | Ceramic AI (shared) |
+---
+## Docker (HuggingFace Spaces)
+```bash
+docker build -t tool-orchestrator-env:latest .
+docker run -p 8000:8000 -e CERAMIC_API_KEY=cer_sk_live_... tool-orchestrator-env:latest
+```
+---
+## Datasets
+- **HotpotQA** — Yang et al., 2018. Multi-hop reasoning over Wikipedia.
+- **MATH** — Hendrycks et al., 2021. Competition math levels 3–5.
+- **GPQA** — Rein et al., 2023. Graduate-level science QA.
+- **HumanEval** — Chen et al., 2021. Python programming tasks.
+---
+## About
+ToolOrchestratorEnv extends SearchEconomicsEnv to a multi-tool setting, framing cost-aware tool selection as the core RL objective. Built for the OpenEnv competition track at AgentX (Berkeley RDI). Ceramic AI search API powers live web retrieval.

app.py ADDED Viewed

	@@ -0,0 +1,208 @@

+"""FastAPI server for ToolOrchestratorEnv.
+Exposes the OpenEnv standard endpoints:
+  POST /reset          → OrchestratorObservation + OrchestratorState
+  POST /step           → OrchestratorObservation + reward + done + state
+  GET  /health         → {"status": "ok"}
+  GET  /web            → simple demo UI
+  GET  /docs           → OpenAPI (automatic)
+"""
+from __future__ import annotations
+import os
+import uuid
+from contextlib import asynccontextmanager
+from typing import Any, Dict, Optional
+from fastapi import FastAPI, HTTPException
+from fastapi.responses import HTMLResponse
+from pydantic import BaseModel
+from data.loader import load_all
+from env.config import EnvConfig
+from env.environment import ToolOrchestratorEnvironment
+from env.models import OrchestratorAction
+from tools import build_tool_registry
+# ---------------------------------------------------------------------------
+# Request / response wrappers
+# ---------------------------------------------------------------------------
+class ResetRequest(BaseModel):
+    seed: Optional[int] = None
+    config_overrides: Optional[Dict[str, Any]] = None
+class StepRequest(BaseModel):
+    tool_id: str
+    query:        Optional[str] = None
+    expression:   Optional[str] = None
+    code_snippet: Optional[str] = None
+    answer:       Optional[str] = None
+    metadata:     Optional[Dict[str, Any]] = None
+# ---------------------------------------------------------------------------
+# App factory
+# ---------------------------------------------------------------------------
+def create_app() -> FastAPI:
+    config  = EnvConfig()
+    tools   = build_tool_registry(config)
+    dataset = load_all(split=config.data_split, max_per_domain=200)
+    # Multi-session state: session_id → ToolOrchestratorEnvironment
+    sessions: Dict[str, ToolOrchestratorEnvironment] = {}
+    # Default shared environment for single-session usage (no session_id)
+    default_env = ToolOrchestratorEnvironment(config=config, tools=tools, dataset=dataset)
+    @asynccontextmanager
+    async def lifespan(app: FastAPI):
+        yield
+    app = FastAPI(
+        title="ToolOrchestratorEnv",
+        description="Multi-tool cost-aware RL environment (OpenEnv / AgentX)",
+        version="0.1.0",
+        lifespan=lifespan,
+        root_path=os.environ.get("ROOT_PATH", ""),
+    )
+    @app.get("/health")
+    def health():
+        return {"status": "ok"}
+    @app.post("/reset")
+    def reset(req: ResetRequest):
+        cfg = EnvConfig()
+        if req.config_overrides:
+            for k, v in req.config_overrides.items():
+                if hasattr(cfg, k):
+                    setattr(cfg, k, v)
+        env = ToolOrchestratorEnvironment(config=cfg, tools=tools, dataset=dataset)
+        obs, state = env.reset(seed=req.seed)
+        session_id = str(uuid.uuid4())
+        sessions[session_id] = env
+        return {
+            "session_id":  session_id,
+            "observation": obs.model_dump(),
+            "state":       state.model_dump(),
+        }
+    @app.post("/step")
+    def step(req: StepRequest, session_id: Optional[str] = None):
+        env = sessions.get(session_id or "", default_env)
+        action = OrchestratorAction(
+            tool_id=req.tool_id,
+            query=req.query or "",
+            expression=req.expression or "",
+            code_snippet=req.code_snippet or "",
+            answer=req.answer or "",
+            metadata=req.metadata,
+        )
+        try:
+            obs, reward, done, state = env.step(action)
+        except RuntimeError as exc:
+            raise HTTPException(status_code=400, detail=str(exc))
+        except ValueError as exc:
+            raise HTTPException(status_code=422, detail=str(exc))
+        # Clean up finished sessions
+        if done and session_id and session_id in sessions:
+            del sessions[session_id]
+        return {
+            "observation": obs.model_dump(),
+            "reward":      reward,
+            "done":        done,
+            "state":       state.model_dump(),
+        }
+    @app.get("/web", response_class=HTMLResponse)
+    def web_ui():
+        return _DEMO_HTML
+    return app
+app = create_app()
+# ---------------------------------------------------------------------------
+# Demo UI
+# ---------------------------------------------------------------------------
+_DEMO_HTML = """<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>ToolOrchestratorEnv</title>
+<style>
+  body { font-family: monospace; max-width: 860px; margin: 40px auto; padding: 0 20px; }
+  h1   { color: #333; }
+  pre  { background: #f4f4f4; padding: 12px; border-radius: 6px; overflow-x: auto; }
+  button { padding: 8px 16px; margin: 4px; cursor: pointer; }
+  input, select, textarea { width: 100%; padding: 6px; margin: 4px 0; box-sizing: border-box; }
+  label { font-weight: bold; }
+  .tool-btn { background: #e8f0fe; border: 1px solid #4a90e2; border-radius: 4px; }
+  .tool-btn:hover { background: #cfe1ff; }
+  #log { max-height: 480px; overflow-y: auto; }
+</style>
+</head>
+<body>
+<h1>ToolOrchestratorEnv</h1>
+<p>Multi-tool cost-aware RL environment — AgentX / OpenEnv</p>
+<button onclick="doReset()">Reset Episode</button>
+<hr>
+<label>Tool:</label>
+<select id="tool">
+  <option value="ceramic_search">ceramic_search (cost 1.0) — Web retrieval</option>
+  <option value="wiki_lookup">wiki_lookup (cost 0.5) — Wikipedia</option>
+  <option value="calculator">calculator (cost 0.1) — Arithmetic / math</option>
+  <option value="code_executor">code_executor (cost 0.3) — Python execution</option>
+  <option value="llm_reason">llm_reason (cost 2.0) — LLM chain-of-thought</option>
+  <option value="commit">commit (cost 0.0) — Submit answer</option>
+</select>
+<label>Query / Expression / Code / Answer:</label>
+<textarea id="query" rows="3" placeholder="Enter query or answer..."></textarea>
+<button class="tool-btn" onclick="doStep()">Step</button>
+<hr>
+<pre id="log">Click "Reset Episode" to start.</pre>
+<script>
+const log = document.getElementById('log');
+let sessionId = null;
+function append(text) { log.textContent += text + '\\n---\\n'; log.scrollTop = log.scrollHeight; }
+async function doReset() {
+  log.textContent = '';
+  const res = await fetch('/reset', { method: 'POST', headers: {'Content-Type':'application/json'}, body: JSON.stringify({seed: 42}) });
+  const data = await res.json();
+  sessionId = data.session_id || null;
+  append('RESET session=' + sessionId + '\\n' + JSON.stringify(data, null, 2));
+}
+async function doStep() {
+  const tool_id = document.getElementById('tool').value;
+  const input   = document.getElementById('query').value;
+  const body    = { tool_id };
+  if (tool_id === 'commit')         body.answer = input;
+  else if (tool_id === 'calculator') body.expression = input;
+  else if (tool_id === 'code_executor') body.code_snippet = input;
+  else                              body.query = input;
+  const url = sessionId ? '/step?session_id=' + encodeURIComponent(sessionId) : '/step';
+  const res = await fetch(url, { method: 'POST', headers: {'Content-Type':'application/json'}, body: JSON.stringify(body) });
+  const data = await res.json();
+  append('STEP tool_id=' + tool_id + '\\n' + JSON.stringify(data, null, 2));
+}
+</script>
+</body>
+</html>
+"""

baselines/__init__.py ADDED Viewed

File without changes

baselines/cheapest_first.py ADDED Viewed

	@@ -0,0 +1,69 @@

+"""Cheapest-first baseline — always calls the cheapest available tool first."""
+from __future__ import annotations
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
+from data.loader import load_all
+from env.config import EnvConfig
+from env.environment import ToolOrchestratorEnvironment
+from env.models import OrchestratorAction
+from tools import build_tool_registry
+class CheapestFirstBaseline:
+    """Calls tools in ascending cost order, commits after exhausting budget."""
+    def __init__(self, config: EnvConfig):
+        # Sort non-commit tools by cost
+        self._order = sorted(
+            [t for t in config.tool_costs if t != "commit"],
+            key=lambda t: config.tool_costs[t],
+        )
+        self._commit_after = config.max_steps_per_question - 1
+        self._steps_on_q = 0
+    def get_action(self, obs) -> OrchestratorAction:
+        if self._steps_on_q >= self._commit_after:
+            self._steps_on_q = 0
+            return OrchestratorAction(tool_id="commit", answer="I don't know")
+        tool = self._order[self._steps_on_q % len(self._order)]
+        self._steps_on_q += 1
+        query = obs.question[:100] if hasattr(obs, "question") else ""
+        return OrchestratorAction(tool_id=tool, query=query)
+    def reset(self):
+        self._steps_on_q = 0
+def run_episode(seed: int = 0) -> dict:
+    config = EnvConfig(num_questions=5, total_budget=30.0, seed=seed)
+    tools = build_tool_registry(config)
+    dataset = load_all(max_per_domain=20)
+    env = ToolOrchestratorEnvironment(config=config, tools=tools, dataset=dataset)
+    agent = CheapestFirstBaseline(config)
+    obs, state = env.reset(seed=seed)
+    agent.reset()
+    total_reward = 0.0
+    done = False
+    while not done:
+        action = agent.get_action(obs)
+        if action.tool_id == "commit":
+            agent.reset()
+        obs, reward, done, state = env.step(action)
+        total_reward += reward
+    return {
+        "total_reward": total_reward,
+        "accuracy": state.current_accuracy,
+        "budget_used": state.budget_spent,
+        "questions_answered": state.questions_answered,
+    }
+if __name__ == "__main__":
+    result = run_episode(seed=42)
+    print("CheapestFirstBaseline:", result)

baselines/oracle.py ADDED Viewed

	@@ -0,0 +1,76 @@

+"""Domain-aware oracle baseline — picks the best tool per domain heuristically."""
+from __future__ import annotations
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
+from data.loader import load_all
+from env.config import EnvConfig
+from env.environment import ToolOrchestratorEnvironment
+from env.models import OrchestratorAction
+from tools import build_tool_registry
+# Heuristic: which tool to try at each step index for each domain
+_DOMAIN_STRATEGY = {
+    "hotpotqa":  ["ceramic_search", "wiki_lookup", "ceramic_search"],
+    "math":      ["calculator",     "llm_reason",  "calculator"],
+    "gpqa":      ["llm_reason",     "ceramic_search", "wiki_lookup"],
+    "humaneval": ["code_executor",  "llm_reason",  "code_executor"],
+}
+_DEFAULT_STRATEGY = ["ceramic_search", "wiki_lookup", "llm_reason"]
+class OracleBaseline:
+    """Uses domain knowledge to pick the best tool per step."""
+    def __init__(self, config: EnvConfig):
+        self._commit_after = config.max_steps_per_question - 1
+        self._steps_on_q = 0
+    def get_action(self, obs) -> OrchestratorAction:
+        if self._steps_on_q >= self._commit_after:
+            self._steps_on_q = 0
+            return OrchestratorAction(tool_id="commit", answer="I don't know")
+        domain = obs.domain if hasattr(obs, "domain") else "hotpotqa"
+        strategy = _DOMAIN_STRATEGY.get(domain, _DEFAULT_STRATEGY)
+        tool = strategy[self._steps_on_q % len(strategy)]
+        self._steps_on_q += 1
+        query = obs.question[:100] if hasattr(obs, "question") else ""
+        return OrchestratorAction(tool_id=tool, query=query)
+    def reset(self):
+        self._steps_on_q = 0
+def run_episode(seed: int = 0) -> dict:
+    config = EnvConfig(num_questions=5, total_budget=30.0, seed=seed)
+    tools = build_tool_registry(config)
+    dataset = load_all(max_per_domain=20)
+    env = ToolOrchestratorEnvironment(config=config, tools=tools, dataset=dataset)
+    agent = OracleBaseline(config)
+    obs, state = env.reset(seed=seed)
+    agent.reset()
+    total_reward = 0.0
+    done = False
+    while not done:
+        action = agent.get_action(obs)
+        if action.tool_id == "commit":
+            agent.reset()
+        obs, reward, done, state = env.step(action)
+        total_reward += reward
+    return {
+        "total_reward": total_reward,
+        "accuracy": state.current_accuracy,
+        "budget_used": state.budget_spent,
+        "questions_answered": state.questions_answered,
+    }
+if __name__ == "__main__":
+    result = run_episode(seed=42)
+    print("OracleBaseline:", result)

baselines/random_tool.py ADDED Viewed

	@@ -0,0 +1,67 @@

+"""Random tool baseline — picks uniformly from available tools each step."""
+from __future__ import annotations
+import random
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
+from data.loader import load_all
+from env.config import EnvConfig
+from env.environment import ToolOrchestratorEnvironment
+from env.models import TOOL_IDS, OrchestratorAction
+from tools import build_tool_registry
+_NON_COMMIT = [t for t in TOOL_IDS if t != "commit"]
+class RandomToolBaseline:
+    """Selects a random tool each step; commits after max_steps_per_question - 1 steps."""
+    def __init__(self, commit_after: int = 3):
+        self.commit_after = commit_after
+        self._steps_on_q = 0
+    def get_action(self, obs) -> OrchestratorAction:
+        if self._steps_on_q >= self.commit_after:
+            self._steps_on_q = 0
+            return OrchestratorAction(tool_id="commit", answer="I don't know")
+        self._steps_on_q += 1
+        tool = random.choice(_NON_COMMIT)
+        query = obs.question[:100] if hasattr(obs, "question") else ""
+        return OrchestratorAction(tool_id=tool, query=query)
+    def reset(self):
+        self._steps_on_q = 0
+def run_episode(seed: int = 0) -> dict:
+    config = EnvConfig(num_questions=5, total_budget=30.0, seed=seed)
+    tools = build_tool_registry(config)
+    dataset = load_all(max_per_domain=20)
+    env = ToolOrchestratorEnvironment(config=config, tools=tools, dataset=dataset)
+    agent = RandomToolBaseline(commit_after=config.max_steps_per_question - 1)
+    obs, state = env.reset(seed=seed)
+    agent.reset()
+    total_reward = 0.0
+    done = False
+    while not done:
+        action = agent.get_action(obs)
+        if action.tool_id == "commit":
+            agent.reset()
+        obs, reward, done, state = env.step(action)
+        total_reward += reward
+    return {
+        "total_reward": total_reward,
+        "accuracy": state.current_accuracy,
+        "budget_used": state.budget_spent,
+        "questions_answered": state.questions_answered,
+    }
+if __name__ == "__main__":
+    result = run_episode(seed=42)
+    print("RandomToolBaseline:", result)

ceramic/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .client import CeramicClient, FallbackCeramicClient, SearchResult, get_ceramic_client
2	+
3	+ __all__ = ["CeramicClient", "FallbackCeramicClient", "SearchResult", "get_ceramic_client"]

ceramic/client.py ADDED Viewed

	@@ -0,0 +1,133 @@

+"""Ceramic AI search client.
+Matches the interface used by SearchEconomicsEnv so both environments
+share the same retrieval backend.
+API key priority:
+  1. CERAMIC_API_KEY  env var
+  2. SEE_CERAMIC_API_KEY env var  (HF Spaces compatibility with SearchEcon)
+  3. Falls back to FallbackCeramicClient  (offline / CI, fully deterministic)
+Ceramic API notes (verified 2025):
+  - Endpoint : POST https://api.ceramic.ai/search
+  - Body     : {"query": "<string>"}   (no pagination params supported)
+  - Response : {"requestId": "...", "result": {"results": [...], "totalResults": N}}
+  - Each result has: title, url, description, score
+  - Always returns up to 10 results per call
+"""
+from __future__ import annotations
+import hashlib
+import os
+from dataclasses import dataclass
+from typing import List
+import httpx
+# ---------------------------------------------------------------------------
+# Result model
+# ---------------------------------------------------------------------------
+@dataclass
+class SearchResult:
+    title: str
+    url: str
+    description: str
+    score: float = 0.0
+# ---------------------------------------------------------------------------
+# Live client
+# ---------------------------------------------------------------------------
+class CeramicClient:
+    """Thin wrapper around the Ceramic search API."""
+    BASE_URL = "https://api.ceramic.ai"
+    def __init__(self, api_key: str):
+        self._key = api_key
+        self._client = httpx.Client(
+            headers={"Authorization": f"Bearer {api_key}"},
+            timeout=10.0,
+        )
+    def search(self, query: str, top_k: int = 5) -> List[SearchResult]:
+        """Search Ceramic and return up to top_k results (max 10)."""
+        if not query.strip():
+            return []
+        resp = self._client.post(
+            f"{self.BASE_URL}/search",
+            json={"query": query},
+        )
+        resp.raise_for_status()
+        data = resp.json()
+        raw = data.get("result", {}).get("results", [])
+        results = []
+        for item in raw[:top_k]:
+            results.append(SearchResult(
+                title=item.get("title", ""),
+                url=item.get("url", ""),
+                description=item.get("description", ""),
+                score=float(item.get("score", 0.0)),
+            ))
+        return results
+    def close(self):
+        self._client.close()
+    def __enter__(self):
+        return self
+    def __exit__(self, *args):
+        self.close()
+# ---------------------------------------------------------------------------
+# Offline fallback
+# ---------------------------------------------------------------------------
+class FallbackCeramicClient:
+    """Deterministic offline client used when no API key is available.
+    Generates reproducible fake results via SHA-256 hashing so tests
+    and CI runs are stable without network access.
+    """
+    def search(self, query: str, top_k: int = 5) -> List[SearchResult]:
+        h = int(hashlib.sha256(query.encode()).hexdigest(), 16)
+        results = []
+        for i in range(min(top_k, 3)):
+            seed = (h + i) % 10_000
+            results.append(SearchResult(
+                title=f"Result {seed}: {query[:40]}",
+                url=f"https://fallback.example.com/doc/{seed}",
+                description=f"Offline fallback result #{i+1} for query: {query}",
+                score=round(0.9 - i * 0.15, 3),
+            ))
+        return results
+    def close(self):
+        pass
+    def __enter__(self):
+        return self
+    def __exit__(self, *args):
+        pass
+# ---------------------------------------------------------------------------
+# Factory
+# ---------------------------------------------------------------------------
+def get_ceramic_client() -> CeramicClient | FallbackCeramicClient:
+    """Return a live CeramicClient if a key is set, otherwise FallbackCeramicClient."""
+    key = (
+        os.environ.get("CERAMIC_API_KEY")
+        or os.environ.get("SEE_CERAMIC_API_KEY")
+    )
+    if key:
+        return CeramicClient(api_key=key)
+    return FallbackCeramicClient()

client.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""Compatibility shim — real code lives in ceramic/client.py.
+Mirrors the SearchEconomicsEnv CeramicClient interface so the two
+environments share the same retrieval backend.
+Priority for the API key:
+  1. CERAMIC_API_KEY env var
+  2. SEE_CERAMIC_API_KEY env var (HF Spaces compatibility)
+  3. Falls back to FallbackCeramicClient (offline, deterministic)
+"""
+from __future__ import annotations
+import hashlib
+import os
+import time
+from dataclasses import dataclass, field
+from typing import List, Optional
+import httpx
+# ---------------------------------------------------------------------------
+# Result model
+# ---------------------------------------------------------------------------
+@dataclass
+class SearchResult:
+    title: str
+    url: str
+    description: str
+    score: float = 0.0
+# ---------------------------------------------------------------------------
+# Live client
+# ---------------------------------------------------------------------------
+class CeramicClient:
+    """Thin wrapper around the Ceramic search API."""
+    BASE_URL = "https://api.ceramic.ai/v1"
+    def __init__(self, api_key: str):
+        self._key = api_key
+        self._client = httpx.Client(
+            headers={"Authorization": f"Bearer {api_key}"},
+            timeout=10.0,
+        )
+    def search(self, query: str, top_k: int = 5) -> List[SearchResult]:
+        resp = self._client.post(
+            f"{self.BASE_URL}/search",
+            json={"query": query, "top_k": top_k},
+        )
+        resp.raise_for_status()
+        data = resp.json()
+        results = []
+        for item in data.get("results", []):
+            results.append(SearchResult(
+                title=item.get("title", ""),
+                url=item.get("url", ""),
+                description=item.get("description", ""),
+                score=float(item.get("score", 0.0)),
+            ))
+        return results
+    def close(self):
+        self._client.close()
+    def __enter__(self):
+        return self
+    def __exit__(self, *args):
+        self.close()
+# ---------------------------------------------------------------------------
+# Offline fallback
+# ---------------------------------------------------------------------------
+class FallbackCeramicClient:
+    """Deterministic offline client — used when no API key is set."""
+    def search(self, query: str, top_k: int = 5) -> List[SearchResult]:
+        # Stable hash → reproducible fake results per query
+        h = int(hashlib.sha256(query.encode()).hexdigest(), 16)
+        results = []
+        for i in range(min(top_k, 3)):
+            seed = (h + i) % 10_000
+            results.append(SearchResult(
+                title=f"Result {seed}: {query[:40]}",
+                url=f"https://fallback.example.com/doc/{seed}",
+                description=f"Offline fallback result #{i+1} for query: {query}",
+                score=round(0.9 - i * 0.15, 3),
+            ))
+        return results
+    def close(self):
+        pass
+    def __enter__(self):
+        return self
+    def __exit__(self, *args):
+        pass
+# ---------------------------------------------------------------------------
+# Factory
+# ---------------------------------------------------------------------------
+_DEFAULT_KEY = "cer_sk_live_543fe74e79df_eyJvcmdfaWQiOiJvcmdfMDFLTlpINkU5RVNDTUowUUoyREpINFZWWEYiLCJrZXlfaWQiOiI1NDNmZTc0ZTc5ZGYifQ.k8I4Aljsk29y4Uki37Wxfd7QZHs40XSJVNBNnfksCtM"
+def get_ceramic_client() -> CeramicClient | FallbackCeramicClient:
+    key = (
+        os.environ.get("CERAMIC_API_KEY")
+        or os.environ.get("SEE_CERAMIC_API_KEY")
+        or _DEFAULT_KEY
+    )
+    if key:
+        return CeramicClient(api_key=key)
+    return FallbackCeramicClient()

data/__init__.py ADDED Viewed

File without changes

data/loader.py ADDED Viewed

	@@ -0,0 +1,270 @@

+"""Multi-domain dataset loader for ToolOrchestratorEnv.
+Returns a flat list of question dicts, each with a 'domain' key.
+Adapted from CostAwareToolEnv/scripts/process_datasets.py.
+"""
+from __future__ import annotations
+import random
+import re
+import string
+from typing import Any, Dict, List, Optional
+# ---------------------------------------------------------------------------
+# HuggingFace loader helper
+# ---------------------------------------------------------------------------
+def _hf_load(repo_id: str, config: Optional[str], split: str):
+    import datasets as hf
+    kwargs: Dict[str, Any] = {"split": split, "trust_remote_code": True}
+    if config:
+        kwargs["name"] = config
+    return hf.load_dataset(repo_id, **kwargs)
+# ---------------------------------------------------------------------------
+# MATH (levels 3-5)
+# ---------------------------------------------------------------------------
+def _extract_boxed(solution: str):
+    for cmd in ("boxed", "fbox"):
+        marker = f"\\{cmd}" + "{"
+        start = solution.rfind(marker)
+        if start == -1:
+            continue
+        idx = start + len(marker) - 1
+        depth = 0
+        for i in range(idx, len(solution)):
+            if solution[i] == "{":
+                depth += 1
+            elif solution[i] == "}":
+                depth -= 1
+                if depth == 0:
+                    return solution[i + 1 - (i - idx):i].strip()
+    # fallback: last non-empty line
+    lines = [l.strip() for l in solution.splitlines() if l.strip()]
+    return lines[-1] if lines else ""
+def _load_math(split: str, max_rows: int) -> List[Dict]:
+    candidates = [
+        ("DigitalLearningGmbH/MATH-lighteval", "default", "train"),
+        ("lighteval/MATH-Hard", "default", "train"),
+        ("hendrycks/competition_math", None, "train"),
+    ]
+    dataset = None
+    for repo_id, cfg, spl in candidates:
+        try:
+            dataset = _hf_load(repo_id, cfg, spl)
+            break
+        except Exception:
+            continue
+    if dataset is None:
+        return []
+    rows = []
+    for ex in dataset:
+        level_text = str(ex.get("level", ""))
+        m = re.search(r"(\d+)", level_text)
+        if not m or int(m.group(1)) not in (3, 4, 5):
+            continue
+        answer = _extract_boxed(str(ex.get("solution", "")))
+        rows.append({
+            "question": str(ex.get("problem", "")).strip(),
+            "answer": answer,
+            "domain": "math",
+            "difficulty": m.group(1),
+            "subject": str(ex.get("type", "")),
+            "source": "math",
+        })
+        if len(rows) >= max_rows:
+            break
+    return rows
+# ---------------------------------------------------------------------------
+# HotpotQA
+# ---------------------------------------------------------------------------
+def _load_hotpotqa(split: str, max_rows: int) -> List[Dict]:
+    hf_split = "train" if split in ("train", "validation") else split
+    dataset = None
+    for cfg in ("distractor", "fullwiki"):
+        try:
+            dataset = _hf_load("hotpotqa/hotpot_qa", cfg, hf_split)
+            break
+        except Exception:
+            continue
+    if dataset is None:
+        return []
+    subset = dataset.shuffle(seed=42).select(range(min(max_rows, len(dataset))))
+    rows = []
+    for ex in subset:
+        rows.append({
+            "question": str(ex.get("question", "")).strip(),
+            "answer": str(ex.get("answer", "")).strip(),
+            "domain": "hotpotqa",
+            "difficulty": str(ex.get("level", "")),
+            "type": str(ex.get("type", "")),
+            "source": "hotpotqa",
+        })
+    return rows
+# ---------------------------------------------------------------------------
+# GPQA
+# ---------------------------------------------------------------------------
+def _resolve_gpqa_answer(ex: Dict) -> str:
+    val = str(ex.get("Correct Answer", "")).strip()
+    if val.upper() in {"A", "B", "C", "D"}:
+        mapping = {
+            "A": str(ex.get("Answer A", "")),
+            "B": str(ex.get("Answer B", "")),
+            "C": str(ex.get("Answer C", "")),
+            "D": str(ex.get("Answer D", "")),
+        }
+        return mapping.get(val.upper(), val).strip()
+    return val
+def _load_gpqa(split: str, max_rows: int) -> List[Dict]:
+    dataset = None
+    for repo in ("Idavidrein/gpqa", "Wanfq/gpqa"):
+        for cfg in ("gpqa_diamond", "gpqa_main"):
+            try:
+                dataset = _hf_load(repo, cfg, "train")
+                break
+            except Exception:
+                continue
+        if dataset is not None:
+            break
+    if dataset is None:
+        return []
+    rows = []
+    for ex in dataset:
+        answer = _resolve_gpqa_answer(ex)
+        rows.append({
+            "question": str(ex.get("Question", "")).strip(),
+            "answer": answer,
+            "domain": "gpqa",
+            "difficulty": "graduate",
+            "source": "gpqa",
+        })
+        if len(rows) >= max_rows:
+            break
+    return rows
+# ---------------------------------------------------------------------------
+# HumanEval
+# ---------------------------------------------------------------------------
+def _load_humaneval(split: str, max_rows: int) -> List[Dict]:
+    dataset = None
+    for repo in ("openai/openai_humaneval", "openai/human-eval"):
+        try:
+            dataset = _hf_load(repo, None, "test")
+            break
+        except Exception:
+            continue
+    if dataset is None:
+        return []
+    rows = []
+    for ex in dataset:
+        rows.append({
+            "question": str(ex.get("prompt", "")).strip(),
+            "answer": str(ex.get("canonical_solution", "")).strip(),
+            "domain": "humaneval",
+            "difficulty": "code",
+            "task_id": str(ex.get("task_id", "")),
+            "test": str(ex.get("test", "")),
+            "entry_point": str(ex.get("entry_point", "")),
+            "source": "humaneval",
+        })
+        if len(rows) >= max_rows:
+            break
+    return rows
+# ---------------------------------------------------------------------------
+# Synthetic fallback (offline / CI)
+# ---------------------------------------------------------------------------
+_SYNTHETIC_TEMPLATES = [
+    ("What is {a} + {b}?", "{c}", "math"),
+    ("Who wrote {work}?", "{author}", "hotpotqa"),
+    ("Solve for x: {a}x + {b} = {c}", "{x}", "math"),
+    ("What is the capital of {country}?", "{capital}", "hotpotqa"),
+]
+_SYNTHETIC_DATA = [
+    {"a": 12, "b": 7, "c": 19, "work": "Hamlet", "author": "Shakespeare",
+     "country": "France", "capital": "Paris", "x": 3},
+    {"a": 25, "b": 13, "c": 38, "work": "1984", "author": "George Orwell",
+     "country": "Germany", "capital": "Berlin", "x": 5},
+    {"a": 100, "b": 44, "c": 144, "work": "The Odyssey", "author": "Homer",
+     "country": "Japan", "capital": "Tokyo", "x": 7},
+]
+def _synthetic_questions(n: int) -> List[Dict]:
+    rows = []
+    for i in range(n):
+        tmpl, ans_tmpl, domain = _SYNTHETIC_TEMPLATES[i % len(_SYNTHETIC_TEMPLATES)]
+        data = _SYNTHETIC_DATA[i % len(_SYNTHETIC_DATA)]
+        try:
+            question = tmpl.format(**data)
+            answer = ans_tmpl.format(**data)
+        except KeyError:
+            question = f"Synthetic question {i}"
+            answer = f"answer_{i}"
+        rows.append({
+            "question": question,
+            "answer": str(answer),
+            "domain": domain,
+            "difficulty": "easy",
+            "source": "synthetic",
+        })
+    return rows
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+_LOADERS = {
+    "hotpotqa": _load_hotpotqa,
+    "math":     _load_math,
+    "gpqa":     _load_gpqa,
+    "humaneval": _load_humaneval,
+}
+def load_all(split: str = "validation", max_per_domain: int = 200) -> List[Dict]:
+    """Load all four domains and return a flat list with 'domain' keys.
+    Falls back to synthetic questions if a domain is unavailable.
+    """
+    all_questions: List[Dict] = []
+    for domain, loader_fn in _LOADERS.items():
+        try:
+            rows = loader_fn(split, max_per_domain)
+            if rows:
+                all_questions.extend(rows)
+                print(f"[loader] {domain}: {len(rows)} questions")
+            else:
+                raise ValueError("empty")
+        except Exception as exc:
+            print(f"[loader] {domain} unavailable ({exc}), using synthetic fallback")
+            synth = _synthetic_questions(max(5, max_per_domain // 10))
+            for q in synth:
+                q["domain"] = domain
+            all_questions.extend(synth)
+    random.shuffle(all_questions)
+    return all_questions

env/__init__.py ADDED Viewed

File without changes

env/answer_grading.py ADDED Viewed

	@@ -0,0 +1,91 @@

+"""Answer grading utilities: exact match + token F1.
+Ported from SearchEconomicsEnv/env/answer_grading.py and adapted for
+multi-domain use (HotpotQA-style EM/F1 + code/math fallback).
+"""
+from __future__ import annotations
+import json
+import re
+import string
+from collections import Counter
+from typing import Tuple
+# ---------------------------------------------------------------------------
+# Normalisation
+# ---------------------------------------------------------------------------
+def normalize_answer(text: str) -> list[str]:
+    """Lowercase, strip articles/punctuation, tokenise."""
+    text = text.lower().strip()
+    # Remove articles
+    text = re.sub(r"\b(a|an|the)\b", " ", text)
+    # Remove punctuation
+    text = text.translate(str.maketrans("", "", string.punctuation))
+    return text.split()
+# ---------------------------------------------------------------------------
+# Metrics
+# ---------------------------------------------------------------------------
+def exact_match(pred: str, gold: str) -> bool:
+    return normalize_answer(pred) == normalize_answer(gold)
+def token_f1(pred: str, gold: str) -> float:
+    pred_tokens = normalize_answer(pred)
+    gold_tokens = normalize_answer(gold)
+    if not pred_tokens or not gold_tokens:
+        return float(pred_tokens == gold_tokens)
+    common = Counter(pred_tokens) & Counter(gold_tokens)
+    num_common = sum(common.values())
+    if num_common == 0:
+        return 0.0
+    precision = num_common / len(pred_tokens)
+    recall    = num_common / len(gold_tokens)
+    return 2 * precision * recall / (precision + recall)
+# ---------------------------------------------------------------------------
+# Answer extraction
+# ---------------------------------------------------------------------------
+def extract_answer(raw: str) -> str:
+    """Pull the answer string out of various agent output formats."""
+    # Strip markdown fences
+    raw = re.sub(r"```[a-z]*\n?", "", raw).strip()
+    # Try JSON {"answer": ...}
+    try:
+        parsed = json.loads(raw)
+        if isinstance(parsed, dict):
+            for key in ("answer", "Answer", "result", "Result"):
+                if key in parsed:
+                    return str(parsed[key]).strip()
+    except (json.JSONDecodeError, ValueError):
+        pass
+    # Prefix patterns
+    for prefix in ("Answer:", "Final answer:", "Result:", "Output:"):
+        idx = raw.lower().find(prefix.lower())
+        if idx != -1:
+            return raw[idx + len(prefix):].strip().split("\n")[0].strip()
+    # Last non-empty line
+    lines = [line.strip() for line in raw.splitlines() if line.strip()]
+    return lines[-1] if lines else raw.strip()
+# ---------------------------------------------------------------------------
+# Public entry point
+# ---------------------------------------------------------------------------
+def grade(predicted: str, ground_truth: str) -> Tuple[bool, float, float]:
+    """Return (exact_match, f1, quality) where quality ∈ [0, 1]."""
+    extracted = extract_answer(predicted)
+    em = exact_match(extracted, ground_truth)
+    f1 = token_f1(extracted, ground_truth)
+    quality = 1.0 if em else f1
+    return em, f1, quality

env/config.py ADDED Viewed

	@@ -0,0 +1,53 @@

+"""Configuration for ToolOrchestratorEnv.
+All tuneable parameters live here so training scripts, the server, and
+baselines all read from a single source of truth.  Override individual
+fields in /reset via config_overrides, or subclass for experiment sweeps.
+"""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Dict, Optional
+@dataclass
+class EnvConfig:
+    # ── Episode structure ────────────────────────────────────────────────────
+    total_budget: float = 50.0          # Total cost units for the whole episode
+    num_questions: int = 10             # Questions drawn per episode
+    max_steps_per_question: int = 8     # Auto-commit after this many tool calls
+    data_split: str = "validation"      # HuggingFace dataset split to load
+    seed: Optional[int] = None          # Global RNG seed (None = random)
+    shuffle_questions: bool = True      # Shuffle sampled questions each episode
+    # ── Domain mix ──────────────────────────────────────────────────────────
+    # Fraction of questions drawn from each dataset. Must sum to ~1.0.
+    domain_mix: Dict[str, float] = field(default_factory=lambda: {
+        "hotpotqa":  0.4,   # Multi-hop factual QA
+        "math":      0.3,   # Competition math (levels 3-5)
+        "gpqa":      0.2,   # Graduate-level science
+        "humaneval": 0.1,   # Python programming tasks
+    })
+    # ── Tool costs ──────────────────────────────────────────────────────────
+    # Budget units consumed per tool call.  Commit is always free.
+    tool_costs: Dict[str, float] = field(default_factory=lambda: {
+        "ceramic_search": 1.0,
+        "wiki_lookup":    0.5,
+        "calculator":     0.1,
+        "code_executor":  0.3,
+        "llm_reason":     2.0,
+        "commit":         0.0,
+    })
+    # ── Reward shaping ───────────────────────────────────────────────────────
+    correct_reward: float = 1.0             # Base reward for a correct commit
+    incorrect_reward: float = -0.5          # Base reward for a wrong commit
+    efficiency_bonus_weight: float = 0.1    # γ: scales the efficiency bonus
+    efficiency_bonus_threshold: float = 0.5 # Minimum quality to earn the bonus
+    # ── Grading ─────────────────────────────────────────────────────────────
+    # "em_only"  → only exact match counts as correct
+    # "em_or_f1" → token F1 ≥ f1_count_threshold also counts as correct
+    grade_count_correct_mode: str = "em_or_f1"
+    f1_count_threshold: float = 0.5

env/environment.py ADDED Viewed

	@@ -0,0 +1,299 @@

+"""Core RL environment: ToolOrchestratorEnvironment.
+Step logic:
+  - Agent receives an OrchestratorObservation with the current question,
+    budget, context, and available tools.
+  - Agent picks a tool_id and optional query / code_snippet / answer.
+  - Environment dispatches to the appropriate tool, charges cost, appends
+    result to context_window, and returns the next observation + reward.
+  - Episode ends when budget is exhausted OR all questions are answered.
+"""
+from __future__ import annotations
+import time
+import uuid
+from typing import Any, Dict, List, Optional, Tuple
+from .answer_grading import grade
+from .config import EnvConfig
+from .models import (
+    OrchestratorAction,
+    OrchestratorObservation,
+    OrchestratorState,
+    ToolResult,
+    TOOL_IDS,
+)
+from .reward import commit_reward, step_reward
+class ToolOrchestratorEnvironment:
+    """
+    OpenEnv-compatible RL environment for multi-tool cost-aware QA.
+    Supports external tool injection so the server can wire in live
+    Ceramic, code executor, etc.  Tools are callables with signature:
+        tool_fn(action: OrchestratorAction) -> ToolResult
+    """
+    def __init__(
+        self,
+        config: Optional[EnvConfig] = None,
+        tools: Optional[Dict[str, Any]] = None,
+        dataset: Optional[List[Dict[str, Any]]] = None,
+    ):
+        self.config  = config or EnvConfig()
+        self.tools   = tools or {}      # tool_id -> callable
+        self.dataset = dataset or []    # List of {question, answer, domain, ...}
+        self._state: Optional[OrchestratorState] = None
+        self._questions: List[Dict[str, Any]] = []
+        self._current_q_idx: int = 0
+        self._context_window: List[str] = []
+        self._tools_used_this_q: List[str] = []
+        self._steps_this_q: int = 0
+        self._episode_done: bool = False
+    # -----------------------------------------------------------------------
+    # Reset
+    # -----------------------------------------------------------------------
+    def reset(self, seed: Optional[int] = None) -> Tuple[OrchestratorObservation, OrchestratorState]:
+        import random
+        effective_seed = seed if seed is not None else self.config.seed
+        rng = random.Random(effective_seed)
+        questions = _sample_questions(self.dataset, self.config, rng)
+        self._questions = questions
+        self._current_q_idx = 0
+        self._episode_done = False
+        self._state = OrchestratorState(
+            episode_id=str(uuid.uuid4()),
+            total_budget=self.config.total_budget,
+            budget_spent=0,
+            questions_answered=0,
+            total_correct=0,
+            current_accuracy=0.0,
+            budget_remaining_ratio=1.0,
+            current_question_idx=0,
+            current_question_steps=0,
+            step_count=0,
+        )
+        self._context_window = []
+        self._tools_used_this_q = []
+        self._steps_this_q = 0
+        obs = self._make_obs(reward=None, question_done=False, done=False)
+        return obs, self._state
+    # -----------------------------------------------------------------------
+    # Step
+    # -----------------------------------------------------------------------
+    def step(
+        self, action: OrchestratorAction
+    ) -> Tuple[OrchestratorObservation, float, bool, OrchestratorState]:
+        if self._episode_done:
+            raise RuntimeError("Episode is done. Call reset() first.")
+        tool_id = action.tool_id
+        if tool_id not in TOOL_IDS:
+            raise ValueError(f"Unknown tool_id: {tool_id!r}. Valid: {TOOL_IDS}")
+        state   = self._state
+        config  = self.config
+        if self._current_q_idx >= len(self._questions):
+            self._episode_done = True
+            raise RuntimeError("Episode is done. Call reset() first.")
+        q_entry = self._questions[self._current_q_idx]
+        gold    = q_entry["answer"]
+        # ---- Commit ---------------------------------------------------
+        if tool_id == "commit":
+            raw_pred = action.answer or ""
+            em, f1, quality = grade(raw_pred, gold)
+            r = commit_reward(
+                quality=quality,
+                budget_remaining_ratio=state.budget_remaining_ratio,
+                config=config,
+            )
+            is_correct = (
+                em if config.grade_count_correct_mode == "em_only"
+                else (em or f1 >= config.f1_count_threshold)
+            )
+            state.questions_answered += 1
+            state.total_correct += int(is_correct)
+            state.current_accuracy = state.total_correct / state.questions_answered
+            self._current_q_idx += 1
+            self._context_window = []
+            self._tools_used_this_q = []
+            self._steps_this_q = 0
+            episode_done = (
+                self._current_q_idx >= len(self._questions)
+                or state.budget_spent >= state.total_budget
+            )
+            self._episode_done = episode_done
+            state.current_question_idx = self._current_q_idx
+            state.current_question_steps = 0
+            obs = self._make_obs(
+                reward=r,
+                question_done=True,
+                done=episode_done,
+                last_tool_result=ToolResult(
+                    tool_id="commit", cost=0,
+                    output=f"EM={em} F1={f1:.3f} quality={quality:.3f}"
+                ),
+            )
+            return obs, r, episode_done, state
+        # ---- Tool call ------------------------------------------------
+        cost = config.tool_costs.get(tool_id, 0)
+        budget_after = state.budget_spent + cost
+        if budget_after > state.total_budget:
+            r = config.incorrect_reward
+            self._episode_done = True
+            obs = self._make_obs(reward=r, question_done=True, done=True)
+            return obs, r, True, state
+        t0 = time.perf_counter()
+        tool_fn = self.tools.get(tool_id)
+        if tool_fn is None:
+            tool_result = ToolResult(
+                tool_id=tool_id, cost=cost,
+                output="[Tool not available in this environment]",
+                latency_s=0.0,
+                error="not_available",
+            )
+        else:
+            try:
+                tool_result = tool_fn(action)
+                tool_result.cost = cost
+            except Exception as exc:
+                tool_result = ToolResult(
+                    tool_id=tool_id, cost=cost,
+                    output=f"[Error: {exc}]",
+                    latency_s=time.perf_counter() - t0,
+                    error=str(exc),
+                )
+        tool_result.latency_s = time.perf_counter() - t0
+        state.budget_spent = budget_after
+        state.budget_remaining_ratio = max(
+            0.0, (state.total_budget - state.budget_spent) / state.total_budget
+        )
+        state.step_count += 1
+        state.current_question_steps += 1
+        self._steps_this_q += 1
+        self._tools_used_this_q.append(tool_id)
+        self._context_window.append(f"[{tool_id}] {tool_result.output}")
+        r = step_reward(tool_id, config)
+        question_done = self._steps_this_q >= config.max_steps_per_question
+        episode_done  = (
+            state.budget_spent >= state.total_budget
+            or (question_done and self._current_q_idx + 1 >= len(self._questions))
+        )
+        if question_done and not episode_done:
+            self._current_q_idx += 1
+            state.questions_answered += 1
+            self._context_window = []
+            self._tools_used_this_q = []
+            self._steps_this_q = 0
+            state.current_question_idx = self._current_q_idx
+            state.current_question_steps = 0
+        self._episode_done = episode_done
+        obs = self._make_obs(
+            reward=r,
+            question_done=question_done,
+            done=episode_done,
+            last_tool_result=tool_result,
+        )
+        return obs, r, episode_done, state
+    # -----------------------------------------------------------------------
+    # Internal helpers
+    # -----------------------------------------------------------------------
+    def _make_obs(
+        self,
+        reward: Optional[float],
+        question_done: bool,
+        done: bool,
+        last_tool_result: Optional[ToolResult] = None,
+    ) -> OrchestratorObservation:
+        state = self._state
+        cfg   = self.config
+        if 0 <= self._current_q_idx < len(self._questions):
+            q_entry = self._questions[self._current_q_idx]
+        elif self._questions:
+            q_entry = self._questions[-1]
+        else:
+            q_entry = {"question": "", "answer": "", "domain": ""}
+        return OrchestratorObservation(
+            question=q_entry.get("question", ""),
+            question_idx=self._current_q_idx,
+            domain=q_entry.get("domain", ""),
+            question_embedding=[],
+            total_budget=cfg.total_budget,
+            budget_spent=state.budget_spent,
+            budget_remaining=state.total_budget - state.budget_spent,
+            budget_remaining_ratio=state.budget_remaining_ratio,
+            tools_used_this_question=list(self._tools_used_this_q),
+            steps_used_this_question=self._steps_this_q,
+            max_steps_per_question=cfg.max_steps_per_question,
+            last_tool_result=last_tool_result,
+            context_window=list(self._context_window),
+            step_idx=state.step_count,
+            questions_remaining=max(0, len(self._questions) - self._current_q_idx - 1),
+            questions_answered=state.questions_answered,
+            accuracy_so_far=state.current_accuracy,
+            question_done=question_done,
+            done=done,
+            reward=reward,
+        )
+# ---------------------------------------------------------------------------
+# Dataset sampling helper
+# ---------------------------------------------------------------------------
+def _sample_questions(
+    dataset: List[Dict[str, Any]],
+    config: EnvConfig,
+    rng: Any,
+) -> List[Dict[str, Any]]:
+    """Sample `config.num_questions` questions according to domain_mix."""
+    by_domain: Dict[str, List[Dict]] = {}
+    for item in dataset:
+        d = item.get("domain", "hotpotqa")
+        by_domain.setdefault(d, []).append(item)
+    selected = []
+    for domain, frac in config.domain_mix.items():
+        n = round(config.num_questions * frac)
+        pool = by_domain.get(domain, [])
+        if pool and n > 0:
+            selected.extend(rng.sample(pool, min(n, len(pool))))
+    if len(selected) < config.num_questions and dataset:
+        remaining = [d for d in dataset if d not in selected]
+        rng.shuffle(remaining)
+        selected.extend(remaining[: config.num_questions - len(selected)])
+    if config.shuffle_questions:
+        rng.shuffle(selected)
+    return selected[: config.num_questions]

env/models.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""Pydantic data models for ToolOrchestratorEnv.
+Three main types flow through the environment:
+  OrchestratorAction      — agent → env  (what tool to call)
+  OrchestratorObservation — env → agent  (what the agent sees)
+  OrchestratorState       — env → server (full bookkeeping snapshot)
+ToolResult is returned by every tool and attached to the observation.
+"""
+from __future__ import annotations
+from typing import Any, Dict, List, Optional
+from pydantic import BaseModel, Field
+# Canonical tool IDs — order matters for UI display; must match config.tool_costs keys.
+TOOL_IDS = [
+    "ceramic_search",   # web retrieval (most useful for HotpotQA)
+    "wiki_lookup",      # Wikipedia summary (good for entity facts)
+    "calculator",       # safe AST math eval (essential for MATH)
+    "code_executor",    # sandboxed Python (HumanEval)
+    "llm_reason",       # LLM chain-of-thought (GPQA)
+    "commit",           # submit answer — always free
+]
+class OrchestratorAction(BaseModel):
+    """One step of the agent's interaction with the environment.
+    Fields used per tool:
+      ceramic_search  → query
+      wiki_lookup     → query
+      calculator      → expression  (falls back to query if blank)
+      code_executor   → code_snippet (falls back to query if blank)
+      llm_reason      → query
+      commit          → answer
+    """
+    tool_id: str
+    query: str = ""
+    expression: str = ""
+    code_snippet: str = ""
+    answer: str = ""
+    metadata: Optional[Dict[str, Any]] = None
+class ToolResult(BaseModel):
+    """Output produced by one tool call.
+    Attached to OrchestratorObservation.last_tool_result and also
+    appended (as a string) to the context_window.
+    """
+    tool_id: str
+    output: str = ""         # Human-readable result text
+    cost: float = 0.0        # Budget units charged (set by environment)
+    latency_s: float = 0.0   # Wall-clock seconds (set by environment)
+    error: Optional[str] = None  # Non-None if the tool call failed
+class OrchestratorObservation(BaseModel):
+    """Everything the agent sees at the start of each step.
+    Designed to be complete: the agent should be able to make an
+    informed tool-selection decision using only this observation.
+    """
+    # ── Current question ────────────────────────────────────────────────────
+    question: str                               # Full question text
+    question_idx: int                           # Position in the episode (0-indexed)
+    domain: str                                 # "hotpotqa" | "math" | "gpqa" | "humaneval"
+    question_embedding: List[float] = Field(default_factory=list)  # Optional embedding vector
+    # ── Budget ──────────────────────────────────────────────────────────────
+    total_budget: float
+    budget_spent: float
+    budget_remaining: float
+    budget_remaining_ratio: float               # budget_remaining / total_budget ∈ [0, 1]
+    # ── Progress on the current question ────────────────────────────────────
+    tools_used_this_question: List[str] = Field(default_factory=list)
+    steps_used_this_question: int = 0
+    max_steps_per_question: int = 8
+    last_tool_result: Optional[ToolResult] = None
+    context_window: List[str] = Field(default_factory=list)  # "[tool_id] output" strings
+    # ── Episode-level progress ───────────────────────────────────────────────
+    step_idx: int = 0                           # Global step counter
+    questions_remaining: int = 0                # Questions not yet started
+    questions_answered: int = 0                 # Questions that received a commit
+    accuracy_so_far: float = 0.0                # Running correctness rate
+    # ── Terminal signals ─────────────────────────────────────────────────────
+    question_done: bool = False                 # This question just ended (commit or max_steps)
+    done: bool = False                          # Episode is over
+    reward: Optional[float] = None              # Reward from the *previous* step
+class OrchestratorState(BaseModel):
+    """Full bookkeeping snapshot — returned alongside observation for logging.
+    Contains all fields needed to reconstruct the episode history without
+    digging into the environment's internal attributes.
+    """
+    episode_id: str
+    total_budget: float
+    budget_spent: float = 0.0
+    questions_answered: int = 0
+    total_correct: int = 0
+    current_accuracy: float = 0.0
+    budget_remaining_ratio: float = 1.0
+    current_question_idx: int = 0
+    current_question_steps: int = 0
+    step_count: int = 0

env/reward.py ADDED Viewed

	@@ -0,0 +1,57 @@

+"""Reward functions for ToolOrchestratorEnv.
+Adapted from SearchEconomicsEnv/env/reward.py, generalised to a multi-tool
+action space where each tool has its own cost.
+Two reward signals:
+  step_reward   — small negative penalty charged every time the agent calls
+                  a tool (including commit = 0 cost).  Discourages wasted
+                  calls without forbidding exploration.
+  commit_reward — composite reward awarded when the agent submits an answer.
+                  Balances answer quality against remaining budget (Weitzman
+                  style: you earn a bonus for being both correct *and* frugal).
+"""
+from __future__ import annotations
+from .config import EnvConfig
+def step_reward(tool_id: str, config: EnvConfig) -> float:
+    """Return the (negative) cost of calling tool_id.
+    Example: calculator → -0.1,  llm_reason → -2.0,  commit → 0.0
+    """
+    return -config.tool_costs.get(tool_id, 0.0)
+def commit_reward(
+    quality: float,
+    budget_remaining_ratio: float,
+    config: EnvConfig,
+) -> float:
+    """Composite reward on commit.
+    Formula
+    -------
+        base  = incorrect_reward + quality × (correct_reward − incorrect_reward)
+        η     = 1  if quality ≥ efficiency_bonus_threshold, else 0
+        bonus = η × efficiency_bonus_weight × budget_remaining_ratio
+        R     = base + bonus
+    The efficiency bonus (bonus) is only non-zero when the agent both answers
+    correctly (quality above threshold) *and* conserves budget.  This creates
+    a soft incentive to use cheaper tools and commit early when confident.
+    Parameters
+    ----------
+    quality               : float in [0, 1] — max(ExactMatch, TokenF1)
+    budget_remaining_ratio: float in [0, 1] — fraction of budget still unspent
+    config                : EnvConfig
+    """
+    q     = max(0.0, min(1.0, quality))
+    base  = config.incorrect_reward + q * (config.correct_reward - config.incorrect_reward)
+    eta   = 1.0 if q >= config.efficiency_bonus_threshold else 0.0
+    bonus = eta * config.efficiency_bonus_weight * budget_remaining_ratio
+    return base + bonus

environment.py ADDED Viewed

	@@ -0,0 +1,307 @@

+"""Compatibility shim — real code lives in env/environment.py.
+Step logic:
+  - Agent receives an OrchestratorObservation with the current question,
+    budget, context, and available tools.
+  - Agent picks a tool_id and optional query / code_snippet / answer.
+  - Environment dispatches to the appropriate tool, charges cost, appends
+    result to context_window, and returns the next observation + reward.
+  - Episode ends when budget is exhausted OR all questions are answered.
+"""
+from __future__ import annotations
+import time
+import uuid
+from typing import Any, Dict, List, Optional, Tuple
+from env.answer_grading import grade
+from env.config import EnvConfig
+from env.models import (
+    OrchestratorAction,
+    OrchestratorObservation,
+    OrchestratorState,
+    ToolResult,
+    TOOL_IDS,
+)
+from env.reward import commit_reward, step_reward
+class ToolOrchestratorEnvironment:
+    """
+    OpenEnv-compatible RL environment for multi-tool cost-aware QA.
+    Supports external tool injection so the server can wire in live
+    Ceramic, code executor, etc.  Tools are callables with signature:
+        tool_fn(action: OrchestratorAction) -> ToolResult
+    """
+    def __init__(
+        self,
+        config: Optional[EnvConfig] = None,
+        tools: Optional[Dict[str, Any]] = None,
+        dataset: Optional[List[Dict[str, Any]]] = None,
+    ):
+        self.config  = config or EnvConfig()
+        self.tools   = tools or {}      # tool_id -> callable
+        self.dataset = dataset or []    # List of {question, answer, domain}
+        self._state: Optional[OrchestratorState] = None
+        self._questions: List[Dict[str, Any]] = []
+        self._current_q_idx: int = 0
+        self._context_window: List[str] = []
+        self._tools_used_this_q: List[str] = []
+        self._steps_this_q: int = 0
+        self._episode_done: bool = False
+    # -----------------------------------------------------------------------
+    # Reset
+    # -----------------------------------------------------------------------
+    def reset(self, seed: Optional[int] = None) -> Tuple[OrchestratorObservation, OrchestratorState]:
+        import random
+        rng = random.Random(seed if seed is not None else self.config.seed)
+        # Sample questions according to domain_mix
+        questions = _sample_questions(self.dataset, self.config, rng)
+        self._questions = questions
+        self._current_q_idx = 0
+        self._episode_done = False
+        self._state = OrchestratorState(
+            episode_id=str(uuid.uuid4()),
+            total_budget=self.config.total_budget,
+            budget_spent=0,
+            questions_answered=0,
+            total_correct=0,
+            current_accuracy=0.0,
+            budget_remaining_ratio=1.0,
+            current_question_idx=0,
+            current_question_steps=0,
+        )
+        self._context_window = []
+        self._tools_used_this_q = []
+        self._steps_this_q = 0
+        obs = self._make_obs(reward=None, question_done=False, done=False)
+        return obs, self._state
+    # -----------------------------------------------------------------------
+    # Step
+    # -----------------------------------------------------------------------
+    def step(
+        self, action: OrchestratorAction
+    ) -> Tuple[OrchestratorObservation, float, bool, OrchestratorState]:
+        if self._episode_done:
+            raise RuntimeError("Episode is done. Call reset() first.")
+        tool_id = action.tool_id
+        if tool_id not in TOOL_IDS:
+            raise ValueError(f"Unknown tool_id: {tool_id!r}. Valid: {TOOL_IDS}")
+        state   = self._state
+        config  = self.config
+        # Guard against exhausted question list (can happen after last commit)
+        if self._current_q_idx >= len(self._questions):
+            self._episode_done = True
+            raise RuntimeError("Episode is done. Call reset() first.")
+        q_entry = self._questions[self._current_q_idx]
+        gold    = q_entry["answer"]
+        # ---- Commit ---------------------------------------------------
+        if tool_id == "commit":
+            raw_pred = action.answer or ""
+            em, f1, quality = grade(raw_pred, gold)
+            r = commit_reward(
+                quality=quality,
+                budget_remaining_ratio=state.budget_remaining_ratio,
+                config=config,
+            )
+            # Count correct
+            is_correct = (
+                em if config.grade_count_correct_mode == "em_only"
+                else (em or f1 >= config.f1_count_threshold)
+            )
+            state.questions_answered += 1
+            state.total_correct += int(is_correct)
+            state.current_accuracy = state.total_correct / state.questions_answered
+            # Advance to next question or end episode
+            self._current_q_idx += 1
+            self._context_window = []
+            self._tools_used_this_q = []
+            self._steps_this_q = 0
+            episode_done = (
+                self._current_q_idx >= len(self._questions)
+                or state.budget_spent >= state.total_budget
+            )
+            self._episode_done = episode_done
+            state.current_question_idx = self._current_q_idx
+            state.current_question_steps = 0
+            obs = self._make_obs(
+                reward=r,
+                question_done=True,
+                done=episode_done,
+                last_tool_result=ToolResult(
+                    tool_id="commit", cost=0,
+                    output=f"EM={em} F1={f1:.3f} quality={quality:.3f}"
+                ),
+            )
+            return obs, r, episode_done, state
+        # ---- Tool call ------------------------------------------------
+        cost = config.tool_costs.get(tool_id, 0)
+        budget_after = state.budget_spent + cost
+        # If over budget, force commit penalty
+        if budget_after > state.total_budget:
+            r = config.incorrect_reward
+            self._episode_done = True
+            obs = self._make_obs(reward=r, question_done=True, done=True)
+            return obs, r, True, state
+        # Dispatch tool
+        t0 = time.perf_counter()
+        tool_fn = self.tools.get(tool_id)
+        if tool_fn is None:
+            tool_result = ToolResult(
+                tool_id=tool_id, cost=cost,
+                output="[Tool not available in this environment]",
+                latency_s=0.0,
+                error="not_available",
+            )
+        else:
+            try:
+                tool_result = tool_fn(action)
+                tool_result.cost = cost
+            except Exception as exc:
+                tool_result = ToolResult(
+                    tool_id=tool_id, cost=cost,
+                    output=f"[Error: {exc}]",
+                    latency_s=time.perf_counter() - t0,
+                    error=str(exc),
+                )
+        tool_result.latency_s = time.perf_counter() - t0
+        # Charge cost and update state
+        state.budget_spent = budget_after
+        state.budget_remaining_ratio = max(
+            0.0, (state.total_budget - state.budget_spent) / state.total_budget
+        )
+        state.step_count += 1
+        state.current_question_steps += 1
+        self._steps_this_q += 1
+        self._tools_used_this_q.append(tool_id)
+        self._context_window.append(f"[{tool_id}] {tool_result.output}")
+        r = step_reward(tool_id, config)
+        # Auto-commit if max steps reached
+        question_done = self._steps_this_q >= config.max_steps_per_question
+        episode_done  = (
+            state.budget_spent >= state.total_budget
+            or (question_done and self._current_q_idx + 1 >= len(self._questions))
+        )
+        if question_done and not episode_done:
+            self._current_q_idx += 1
+            state.questions_answered += 1
+            self._context_window = []
+            self._tools_used_this_q = []
+            self._steps_this_q = 0
+            state.current_question_idx = self._current_q_idx
+            state.current_question_steps = 0
+        self._episode_done = episode_done
+        obs = self._make_obs(
+            reward=r,
+            question_done=question_done,
+            done=episode_done,
+            last_tool_result=tool_result,
+        )
+        return obs, r, episode_done, state
+    # -----------------------------------------------------------------------
+    # Internal helpers
+    # -----------------------------------------------------------------------
+    def _make_obs(
+        self,
+        reward: Optional[float],
+        question_done: bool,
+        done: bool,
+        last_tool_result: Optional[ToolResult] = None,
+    ) -> OrchestratorObservation:
+        state = self._state
+        cfg   = self.config
+        if 0 <= self._current_q_idx < len(self._questions):
+            q_entry = self._questions[self._current_q_idx]
+        elif self._questions:
+            # Episode finished — repeat last question info (obs is terminal anyway)
+            q_entry = self._questions[-1]
+        else:
+            q_entry = {"question": "", "answer": "", "domain": ""}
+        return OrchestratorObservation(
+            question=q_entry.get("question", ""),
+            question_idx=self._current_q_idx,
+            domain=q_entry.get("domain", ""),
+            question_embedding=[],          # populated by server if needed
+            total_budget=cfg.total_budget,
+            budget_spent=state.budget_spent,
+            budget_remaining=state.total_budget - state.budget_spent,
+            budget_remaining_ratio=state.budget_remaining_ratio,
+            tools_used_this_question=list(self._tools_used_this_q),
+            steps_used_this_question=self._steps_this_q,
+            max_steps_per_question=cfg.max_steps_per_question,
+            last_tool_result=last_tool_result,
+            context_window=list(self._context_window),
+            step_idx=state.step_count,
+            questions_remaining=max(0, len(self._questions) - self._current_q_idx - 1),
+            questions_answered=state.questions_answered,
+            accuracy_so_far=state.current_accuracy,
+            question_done=question_done,
+            done=done,
+            reward=reward,
+        )
+# ---------------------------------------------------------------------------
+# Dataset sampling helper
+# ---------------------------------------------------------------------------
+def _sample_questions(
+    dataset: List[Dict[str, Any]],
+    config: EnvConfig,
+    rng: Any,
+) -> List[Dict[str, Any]]:
+    """Sample `config.num_questions` questions according to domain_mix."""
+    by_domain: Dict[str, List[Dict]] = {}
+    for item in dataset:
+        d = item.get("domain", "hotpotqa")
+        by_domain.setdefault(d, []).append(item)
+    selected = []
+    for domain, frac in config.domain_mix.items():
+        n = round(config.num_questions * frac)
+        pool = by_domain.get(domain, [])
+        if pool and n > 0:
+            selected.extend(rng.sample(pool, min(n, len(pool))))
+    # Guarantee at least num_questions items by filling from the full dataset
+    if len(selected) < config.num_questions and dataset:
+        remaining = [d for d in dataset if d not in selected]
+        rng.shuffle(remaining)
+        selected.extend(remaining[: config.num_questions - len(selected)])
+    if config.shuffle_questions:
+        rng.shuffle(selected)
+    return selected[: config.num_questions]

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec: 1
+app: tool-orchestrator-env
+type: space
+runtime: fastapi
+entrypoint: app:app
+port: 8000

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+fastapi>=0.110.0
+uvicorn[standard]>=0.29.0
+pydantic>=2.0
+numpy>=1.24
+datasets>=2.18.0
+httpx>=0.27.0
+requests>=2.31.0
+together>=1.2.0

tools/__init__.py ADDED Viewed

	@@ -0,0 +1,29 @@

+"""Tool registry for ToolOrchestratorEnv.
+Each tool is a callable: (action: OrchestratorAction) -> ToolResult
+"""
+from __future__ import annotations
+from typing import Callable, Dict
+from env.config import EnvConfig
+from env.models import OrchestratorAction, ToolResult
+from .calculator import calculator_tool
+from .ceramic_search import make_search_tool
+from .code_executor import code_executor_tool
+from .commit import commit_tool
+from .llm_reason import llm_reason_tool
+from .wiki_lookup import wiki_lookup_tool
+def build_tool_registry(config: EnvConfig | None = None) -> Dict[str, Callable]:
+    """Return a mapping of tool_id → tool function."""
+    return {
+        "ceramic_search": make_search_tool(),
+        "calculator":     calculator_tool,
+        "wiki_lookup":    wiki_lookup_tool,
+        "code_executor":  code_executor_tool,
+        "llm_reason":     llm_reason_tool,
+        "commit":         commit_tool,
+    }

tools/calculator.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""Safe AST-based calculator tool.
+Supports arithmetic, comparisons, and basic math functions.
+No exec/eval with arbitrary code — uses ast.literal_eval-style restricted eval.
+"""
+from __future__ import annotations
+import ast
+import math
+import operator
+from typing import Any
+from env.models import OrchestratorAction, ToolResult
+_SAFE_OPS = {
+    ast.Add:  operator.add,
+    ast.Sub:  operator.sub,
+    ast.Mult: operator.mul,
+    ast.Div:  operator.truediv,
+    ast.Pow:  operator.pow,
+    ast.Mod:  operator.mod,
+    ast.FloorDiv: operator.floordiv,
+    ast.USub: operator.neg,
+    ast.UAdd: operator.pos,
+}
+_SAFE_FUNCS: dict[str, Any] = {
+    "abs": abs, "round": round, "min": min, "max": max,
+    "sqrt": math.sqrt, "log": math.log, "log2": math.log2,
+    "log10": math.log10, "exp": math.exp,
+    "sin": math.sin, "cos": math.cos, "tan": math.tan,
+    "floor": math.floor, "ceil": math.ceil,
+    "pi": math.pi, "e": math.e,
+}
+def _safe_eval(node: ast.AST) -> Any:
+    if isinstance(node, ast.Expression):
+        return _safe_eval(node.body)
+    if isinstance(node, ast.Constant):
+        return node.value
+    if isinstance(node, ast.Name):
+        if node.id in _SAFE_FUNCS:
+            return _SAFE_FUNCS[node.id]
+        raise ValueError(f"Unknown name: {node.id!r}")
+    if isinstance(node, ast.BinOp):
+        op_type = type(node.op)
+        if op_type not in _SAFE_OPS:
+            raise ValueError(f"Unsupported operator: {op_type.__name__}")
+        return _SAFE_OPS[op_type](_safe_eval(node.left), _safe_eval(node.right))
+    if isinstance(node, ast.UnaryOp):
+        op_type = type(node.op)
+        if op_type not in _SAFE_OPS:
+            raise ValueError(f"Unsupported unary: {op_type.__name__}")
+        return _SAFE_OPS[op_type](_safe_eval(node.operand))
+    if isinstance(node, ast.Call):
+        func = _safe_eval(node.func)
+        if not callable(func):
+            raise ValueError("Not callable")
+        args = [_safe_eval(a) for a in node.args]
+        return func(*args)
+    raise ValueError(f"Unsupported AST node: {type(node).__name__}")
+def calculator_tool(action: OrchestratorAction) -> ToolResult:
+    expr = (action.expression or action.query or "").strip()
+    if not expr:
+        return ToolResult(tool_id="calculator", output="[No expression provided]", error="empty")
+    try:
+        tree = ast.parse(expr, mode="eval")
+        result = _safe_eval(tree)
+        return ToolResult(tool_id="calculator", output=str(result))
+    except Exception as exc:
+        return ToolResult(tool_id="calculator", output=f"[Calc error: {exc}]", error=str(exc))

tools/ceramic_search.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""Ceramic search tool — wraps CeramicClient."""
+from __future__ import annotations
+from ceramic.client import get_ceramic_client
+from env.models import OrchestratorAction, ToolResult
+def make_search_tool(top_k: int = 3):
+    """Factory: creates a search tool with a shared Ceramic client."""
+    client = get_ceramic_client()
+    def _search(action: OrchestratorAction) -> ToolResult:
+        query = (action.query or "").strip()
+        if not query:
+            return ToolResult(
+                tool_id="ceramic_search",
+                output="[No query provided]",
+                error="empty_query",
+            )
+        try:
+            results = client.search(query, top_k=top_k)
+            snippets = []
+            for r in results:
+                snippets.append(f"**{r.title}** ({r.score:.2f})\n{r.description}")
+            output = "\n\n".join(snippets) if snippets else "[No results found]"
+            return ToolResult(tool_id="ceramic_search", output=output)
+        except Exception as exc:
+            return ToolResult(
+                tool_id="ceramic_search",
+                output=f"[Search error: {exc}]",
+                error=str(exc),
+            )
+    return _search

tools/code_executor.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""Restricted Python code executor.
+Runs code in a sandboxed namespace — blocks os/sys/subprocess imports
+and captures stdout. Intended for math / algorithmic tasks.
+"""
+from __future__ import annotations
+import io
+import sys
+import contextlib
+from env.models import OrchestratorAction, ToolResult
+_BLOCKED_MODULES = frozenset({
+    "os", "sys", "subprocess", "socket", "shutil", "pathlib",
+    "importlib", "builtins", "ctypes", "multiprocessing", "threading",
+    "signal", "pty", "fcntl", "resource", "gc", "inspect",
+})
+_MAX_OUTPUT_CHARS = 2000
+class _BlockedImport:
+    """Raise on any import of blocked modules."""
+    def __init__(self, original_import):
+        self._orig = original_import
+    def __call__(self, name, *args, **kwargs):
+        base = name.split(".")[0]
+        if base in _BLOCKED_MODULES:
+            raise ImportError(f"Module '{name}' is not allowed in code_executor")
+        return self._orig(name, *args, **kwargs)
+def code_executor_tool(action: OrchestratorAction) -> ToolResult:
+    code = (action.code_snippet or action.query or "").strip()
+    if not code:
+        return ToolResult(tool_id="code_executor", output="[No code provided]", error="empty")
+    stdout_buf = io.StringIO()
+    safe_globals = {
+        "__builtins__": {
+            k: v for k, v in __builtins__.items()  # type: ignore[union-attr]
+            if k not in ("open", "exec", "eval", "compile", "__import__")
+        } if isinstance(__builtins__, dict) else {
+            k: getattr(__builtins__, k) for k in dir(__builtins__)
+            if k not in ("open", "exec", "eval", "compile", "__import__")
+        },
+        "__import__": _BlockedImport(__import__),
+        "print": lambda *a, **kw: print(*a, **kw, file=stdout_buf),
+    }
+    try:
+        with contextlib.redirect_stdout(stdout_buf):
+            exec(compile(code, "<code_executor>", "exec"), safe_globals)  # noqa: S102
+        output = stdout_buf.getvalue()[:_MAX_OUTPUT_CHARS] or "[Code ran, no output]"
+        return ToolResult(tool_id="code_executor", output=output)
+    except Exception as exc:
+        return ToolResult(
+            tool_id="code_executor",
+            output=f"[Execution error: {exc}]",
+            error=str(exc),
+        )

tools/commit.py ADDED Viewed

	@@ -0,0 +1,12 @@

+"""Commit tool — passes the answer through; grading happens in environment.py."""
+from __future__ import annotations
+from env.models import OrchestratorAction, ToolResult
+def commit_tool(action: OrchestratorAction) -> ToolResult:
+    answer = (action.answer or "").strip()
+    return ToolResult(
+        tool_id="commit",
+        output=f"Committed answer: {answer[:200]}",
+    )

tools/llm_reason.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""LLM reasoning tool — calls Together AI (or falls back gracefully)."""
+from __future__ import annotations
+import os
+from env.models import OrchestratorAction, ToolResult
+_DEFAULT_MODEL = "meta-llama/Llama-3-8b-chat-hf"
+_MAX_TOKENS = 512
+def llm_reason_tool(action: OrchestratorAction) -> ToolResult:
+    prompt = (action.query or "").strip()
+    if not prompt:
+        return ToolResult(tool_id="llm_reason", output="[No prompt provided]", error="empty")
+    api_key = os.environ.get("TOGETHER_API_KEY") or os.environ.get("TOGETHER_KEY")
+    if not api_key:
+        return ToolResult(
+            tool_id="llm_reason",
+            output="[LLM reasoning not configured — set TOGETHER_API_KEY]",
+            error="no_api_key",
+        )
+    try:
+        import together  # type: ignore
+        client = together.Together(api_key=api_key)
+        resp = client.chat.completions.create(
+            model=_DEFAULT_MODEL,
+            messages=[{"role": "user", "content": prompt}],
+            max_tokens=_MAX_TOKENS,
+            temperature=0.0,
+        )
+        text = resp.choices[0].message.content or ""
+        return ToolResult(tool_id="llm_reason", output=text.strip()[:2000])
+    except ImportError:
+        return ToolResult(
+            tool_id="llm_reason",
+            output="[together package not installed — pip install together]",
+            error="import_error",
+        )
+    except Exception as exc:
+        return ToolResult(tool_id="llm_reason", output=f"[LLM error: {exc}]", error=str(exc))

tools/wiki_lookup.py ADDED Viewed

	@@ -0,0 +1,38 @@

+"""Wikipedia lookup tool — returns the intro paragraph of an article."""
+from __future__ import annotations
+import urllib.parse
+import urllib.request
+import json
+from env.models import OrchestratorAction, ToolResult
+_WIKI_API = "https://en.wikipedia.org/api/rest_v1/page/summary/{}"
+def wiki_lookup_tool(action: OrchestratorAction) -> ToolResult:
+    query = (action.query or "").strip()
+    if not query:
+        return ToolResult(tool_id="wiki_lookup", output="[No query provided]", error="empty_query")
+    title = urllib.parse.quote(query.replace(" ", "_"))
+    url = _WIKI_API.format(title)
+    try:
+        req = urllib.request.Request(url, headers={"User-Agent": "ToolOrchestratorEnv/0.1"})
+        with urllib.request.urlopen(req, timeout=8) as resp:
+            data = json.loads(resp.read().decode())
+        extract = data.get("extract", "").strip()
+        page_title = data.get("title", query)
+        if not extract:
+            return ToolResult(tool_id="wiki_lookup", output=f"[No summary found for '{query}']")
+        return ToolResult(tool_id="wiki_lookup", output=f"**{page_title}**\n{extract[:800]}")
+    except urllib.error.HTTPError as exc:
+        if exc.code == 404:
+            return ToolResult(
+                tool_id="wiki_lookup",
+                output=f"[Wikipedia: no article found for '{query}']",
+                error="not_found",
+            )
+        return ToolResult(tool_id="wiki_lookup", output=f"[Wiki HTTP error {exc.code}]", error=str(exc))
+    except Exception as exc:
+        return ToolResult(tool_id="wiki_lookup", output=f"[Wiki error: {exc}]", error=str(exc))