Spaces:

landrew9
/

ToolOrchestratorEnv

Sleeping

App Files Files Community

Andrew Lara commited on Apr 16

Commit

fb0cc18

1 Parent(s): 98c3ce1

Tighten tool routing and executor safety

Browse files

Files changed (16) hide show

README.md +15 -1
RESEARCH.md +11 -8
app.py +159 -115
client.py +2 -121
env/environment.py +12 -20
environment.py +5 -303
requirements.txt +1 -0
tests/conftest.py +45 -0
tests/test_app.py +82 -0
tests/test_code_executor.py +45 -0
tests/test_tools.py +64 -0
tools/__init__.py +17 -26
tools/calculator.py +20 -0
tools/code_executor.py +164 -35
tools/runtime.py +159 -0
tools/wiki_lookup.py +1 -0

README.md CHANGED Viewed

@@ -108,6 +108,10 @@ Execute one tool call. Pass `session_id` (from `/reset`) as a query param to sup
 { "tool_id": "commit",         "answer": "1889" }
 ```
 ### `GET /health`
 Returns `{"status": "ok"}`.
@@ -123,6 +127,7 @@ claude_toolOrchestrator/
 ├── openenv.yaml            # OpenEnv deployment spec
 ├── requirements.txt        # Python dependencies
 ├── .env.example            # Key template (copy → .env, never commit .env)
 │
 ├── env/                    # ── Core RL environment ──────────────────────────
 │   ├── environment.py      # CostAwareToolEnvironment: reset() + step()
@@ -141,7 +146,7 @@ claude_toolOrchestrator/
 │   ├── ceramic_search.py   # Web search (Ceramic AI API)
 │   ├── wiki_lookup.py      # Wikipedia REST API, first paragraph
 │   ├── calculator.py       # Safe AST-based math evaluator (no exec)
-│   ├── code_executor.py    # Sandboxed Python exec (blocks os/sys/subprocess)
 │   ├── llm_reason.py       # Together AI chain-of-thought (graceful fallback)
 │   └── commit.py           # Answer pass-through; grading runs in environment
 │
@@ -164,6 +169,15 @@ claude_toolOrchestrator/
 If no Ceramic key is set, `ceramic_search` falls back to deterministic offline results; all other tools work without any key.
 ---
 ## Running baselines

 { "tool_id": "commit",         "answer": "1889" }
 ```
+### `GET /tools`
+Returns the canonical tool manifest with each tool's label, purpose, input field, cost, and safety notes.
 ### `GET /health`
 Returns `{"status": "ok"}`.
 ├── openenv.yaml            # OpenEnv deployment spec
 ├── requirements.txt        # Python dependencies
 ├── .env.example            # Key template (copy → .env, never commit .env)
+├── tools/runtime.py        # Tool catalog, validation, and explicit dispatch
 │
 ├── env/                    # ── Core RL environment ──────────────────────────
 │   ├── environment.py      # CostAwareToolEnvironment: reset() + step()
 │   ├── ceramic_search.py   # Web search (Ceramic AI API)
 │   ├── wiki_lookup.py      # Wikipedia REST API, first paragraph
 │   ├── calculator.py       # Safe AST-based math evaluator (no exec)
+│   ├── code_executor.py    # Sandboxed Python exec (blocked imports, dunder attrs)
 │   ├── llm_reason.py       # Together AI chain-of-thought (graceful fallback)
 │   └── commit.py           # Answer pass-through; grading runs in environment
 │
 If no Ceramic key is set, `ceramic_search` falls back to deterministic offline results; all other tools work without any key.
+The Python executor is intentionally narrow:
+- import statements are rejected
+- obvious sandbox-escape names such as `open`, `eval`, `globals`, and `__import__` are blocked
+- dunder attribute access such as `.__class__` and `.__subclasses__()` is blocked
+- only a curated builtin/module surface is exposed
+That keeps the tool usable for intended coding tasks without turning it into a hidden general-purpose shell.
 ---
 ## Running baselines

RESEARCH.md CHANGED Viewed

@@ -147,7 +147,7 @@ action.query = "William Shakespeare"
 ### `calculator` — Cost: **0.1**
-**What it does:** Evaluates a math expression safely using Python's `ast` (Abstract Syntax Tree) module. The expression is parsed into a tree structure, and only pre-approved operations are allowed (addition, subtraction, multiplication, division, power, modulo, and common math functions like `sqrt`, `log`, `sin`, `cos`).
 **Why not just use `eval()`?** Because `eval("__import__('os').system('rm -rf /')")` would delete your hard drive. The AST approach means the code is never executed — it's parsed into a data structure and we only compute what we explicitly allow.
@@ -168,9 +168,9 @@ action.expression = "import os"  # BLOCKED — not a valid math expression
 ### `code_executor` — Cost: **0.3**
-**What it does:** Runs arbitrary Python code in a sandboxed `exec()` environment. Captures whatever is printed to stdout and returns it as the result.
-**Security model:** Blocks imports of dangerous modules (`os`, `sys`, `subprocess`, `socket`, `shutil`, `pathlib`, `importlib`, `ctypes`, `multiprocessing`, `threading`, and more). Uses a custom `__import__` wrapper that raises an `ImportError` before the module loads.
 **Best for:** HumanEval coding tasks where the agent needs to actually run code to verify correctness.
@@ -197,6 +197,8 @@ print(fibonacci(10))
 **Graceful fallback:** If `TOGETHER_API_KEY` is not set, returns a clear error message instead of crashing. The agent learns to avoid this tool when it's unavailable.
 ---
 ### `commit` — Cost: **0.0**
@@ -503,9 +505,9 @@ A well-trained agent should exhibit these behaviors:
 CostAwareToolEnv/
 │
 ├── app.py
-│   The FastAPI web server. Handles /reset, /step, /health, /web.
 │   Multi-session: each /reset returns a session_id used in /step.
-│   Loads dataset + tools once at startup for efficiency.
 │
 ├── openenv.yaml
 │   Deployment spec for the OpenEnv competition framework.
@@ -552,11 +554,12 @@ CostAwareToolEnv/
 │       Returns flat List[Dict] with 'domain' key on each item.
 │
 ├── tools/
-│   ├── __init__.py       build_tool_registry() — returns {tool_id: callable}
 │   ├── ceramic_search.py make_search_tool() factory wrapping CeramicClient
 │   ├── wiki_lookup.py    Wikipedia REST API, first paragraph
-│   ├── calculator.py     Safe AST-based math eval
-│   ├── code_executor.py  Sandboxed exec with blocked dangerous imports
 │   ├── llm_reason.py     Together AI API, graceful fallback
 │   └── commit.py         Pass-through; grading is in environment.py
 │

 ### `calculator` — Cost: **0.1**
+**What it does:** Evaluates a math expression safely using Python's `ast` (Abstract Syntax Tree) module. The expression is parsed into a tree structure, and only pre-approved operations are allowed (addition, subtraction, multiplication, division, power, modulo, comparisons, and common math functions like `sqrt`, `log`, `sin`, `cos`).
 **Why not just use `eval()`?** Because `eval("__import__('os').system('rm -rf /')")` would delete your hard drive. The AST approach means the code is never executed — it's parsed into a data structure and we only compute what we explicitly allow.
 ### `code_executor` — Cost: **0.3**
+**What it does:** Runs Python code in a sandboxed `exec()` environment for intended coding tasks. Captures whatever is printed to stdout and returns it as the result.
+**Security model:** Blocks import statements, dangerous builtin names such as `open`, `eval`, `exec`, `globals`, and obvious object-graph escape paths such as dunder attribute traversal. Only a curated builtin/module surface is exposed.
 **Best for:** HumanEval coding tasks where the agent needs to actually run code to verify correctness.
 **Graceful fallback:** If `TOGETHER_API_KEY` is not set, returns a clear error message instead of crashing. The agent learns to avoid this tool when it's unavailable.
+**Tool routing note:** The environment exposes the canonical tool manifest at `GET /tools`, and tool dispatch normalizes missing-tool and tool-crash cases into explicit `ToolResult` errors. That keeps the OpenEnv-style contract stable even when a backing service is missing.
 ---
 ### `commit` — Cost: **0.0**
 CostAwareToolEnv/
 │
 ├── app.py
+│   The FastAPI web server. Handles /reset, /step, /health, /tools, /web.
 │   Multi-session: each /reset returns a session_id used in /step.
+│   Lazily loads the dataset and exposes the canonical tool manifest.
 │
 ├── openenv.yaml
 │   Deployment spec for the OpenEnv competition framework.
 │       Returns flat List[Dict] with 'domain' key on each item.
 │
 ├── tools/
+│   ├── runtime.py        Tool catalog, validation, and explicit dispatch
+│   ├── __init__.py       build_tool_registry() + tool manifest helpers
 │   ├── ceramic_search.py make_search_tool() factory wrapping CeramicClient
 │   ├── wiki_lookup.py    Wikipedia REST API, first paragraph
+│   ├── calculator.py     Safe AST-based math eval with comparisons
+│   ├── code_executor.py  Sandboxed exec with blocked imports and dunder escapes
 │   ├── llm_reason.py     Together AI API, graceful fallback
 │   └── commit.py         Pass-through; grading is in environment.py
 │

app.py CHANGED Viewed

@@ -1,18 +1,20 @@
 """FastAPI server for CostAwareToolEnv.
 Exposes the OpenEnv standard endpoints:
-  POST /reset          → OrchestratorObservation + OrchestratorState
-  POST /step           → OrchestratorObservation + reward + done + state
-  GET  /health         → {"status": "ok"}
-  GET  /web            → simple demo UI
-  GET  /docs           → OpenAPI (automatic)
 """
 from __future__ import annotations
 import os
 import uuid
 from contextlib import asynccontextmanager
-from typing import Any, Dict, Optional
 from fastapi import FastAPI, HTTPException
 from fastapi.responses import HTMLResponse
@@ -22,13 +24,9 @@ from data.loader import load_all
 from env.config import EnvConfig
 from env.environment import CostAwareToolEnvironment
 from env.models import OrchestratorAction
-from tools import build_tool_registry
-# ---------------------------------------------------------------------------
-# Request / response wrappers
-# ---------------------------------------------------------------------------
 class ResetRequest(BaseModel):
     seed: Optional[int] = None
     config_overrides: Optional[Dict[str, Any]] = None
@@ -36,27 +34,140 @@ class ResetRequest(BaseModel):
 class StepRequest(BaseModel):
     tool_id: str
-    query:        Optional[str] = None
-    expression:   Optional[str] = None
     code_snippet: Optional[str] = None
-    answer:       Optional[str] = None
-    metadata:     Optional[Dict[str, Any]] = None
-# ---------------------------------------------------------------------------
-# App factory
-# ---------------------------------------------------------------------------
-def create_app() -> FastAPI:
-    config  = EnvConfig()
-    tools   = build_tool_registry(config)
-    dataset = load_all(split=config.data_split, max_per_domain=200)
-    # Multi-session state: session_id → CostAwareToolEnvironment
     sessions: Dict[str, CostAwareToolEnvironment] = {}
-    # Default shared environment for single-session usage (no session_id)
-    default_env = CostAwareToolEnvironment(config=config, tools=tools, dataset=dataset)
     @asynccontextmanager
     async def lifespan(app: FastAPI):
@@ -74,28 +185,38 @@ def create_app() -> FastAPI:
     def health():
         return {"status": "ok"}
     @app.post("/reset")
     def reset(req: ResetRequest):
-        cfg = EnvConfig()
-        if req.config_overrides:
-            for k, v in req.config_overrides.items():
-                if hasattr(cfg, k):
-                    setattr(cfg, k, v)
-        env = CostAwareToolEnvironment(config=cfg, tools=tools, dataset=dataset)
         obs, state = env.reset(seed=req.seed)
         session_id = str(uuid.uuid4())
         sessions[session_id] = env
         return {
-            "session_id":  session_id,
             "observation": obs.model_dump(),
-            "state":       state.model_dump(),
         }
     @app.post("/step")
     def step(req: StepRequest, session_id: Optional[str] = None):
-        env = sessions.get(session_id or "", default_env)
         action = OrchestratorAction(
             tool_id=req.tool_id,
             query=req.query or "",
@@ -111,98 +232,21 @@ def create_app() -> FastAPI:
         except ValueError as exc:
             raise HTTPException(status_code=422, detail=str(exc))
-        # Clean up finished sessions
         if done and session_id and session_id in sessions:
             del sessions[session_id]
         return {
             "observation": obs.model_dump(),
-            "reward":      reward,
-            "done":        done,
-            "state":       state.model_dump(),
         }
     @app.get("/web", response_class=HTMLResponse)
     def web_ui():
-        return _DEMO_HTML
     return app
 app = create_app()
-# ---------------------------------------------------------------------------
-# Demo UI
-# ---------------------------------------------------------------------------
-_DEMO_HTML = """<!DOCTYPE html>
-<html lang="en">
-<head>
-<meta charset="UTF-8">
-<title>CostAwareToolEnv</title>
-<style>
-  body { font-family: monospace; max-width: 860px; margin: 40px auto; padding: 0 20px; }
-  h1   { color: #333; }
-  pre  { background: #f4f4f4; padding: 12px; border-radius: 6px; overflow-x: auto; }
-  button { padding: 8px 16px; margin: 4px; cursor: pointer; }
-  input, select, textarea { width: 100%; padding: 6px; margin: 4px 0; box-sizing: border-box; }
-  label { font-weight: bold; }
-  .tool-btn { background: #e8f0fe; border: 1px solid #4a90e2; border-radius: 4px; }
-  .tool-btn:hover { background: #cfe1ff; }
-  #log { max-height: 480px; overflow-y: auto; }
-</style>
-</head>
-<body>
-<h1>CostAwareToolEnv</h1>
-<p>Multi-tool cost-aware RL environment — AgentX / OpenEnv</p>
-<button onclick="doReset()">Reset Episode</button>
-<hr>
-<label>Tool:</label>
-<select id="tool">
-  <option value="ceramic_search">ceramic_search (cost 1.0) — Web retrieval</option>
-  <option value="wiki_lookup">wiki_lookup (cost 0.5) — Wikipedia</option>
-  <option value="calculator">calculator (cost 0.1) — Arithmetic / math</option>
-  <option value="code_executor">code_executor (cost 0.3) — Python execution</option>
-  <option value="llm_reason">llm_reason (cost 2.0) — LLM chain-of-thought</option>
-  <option value="commit">commit (cost 0.0) — Submit answer</option>
-</select>
-<label>Query / Expression / Code / Answer:</label>
-<textarea id="query" rows="3" placeholder="Enter query or answer..."></textarea>
-<button class="tool-btn" onclick="doStep()">Step</button>
-<hr>
-<pre id="log">Click "Reset Episode" to start.</pre>
-<script>
-const log = document.getElementById('log');
-let sessionId = null;
-function append(text) { log.textContent += text + '\\n---\\n'; log.scrollTop = log.scrollHeight; }
-async function doReset() {
-  log.textContent = '';
-  const res = await fetch('/reset', { method: 'POST', headers: {'Content-Type':'application/json'}, body: JSON.stringify({seed: 42}) });
-  const data = await res.json();
-  sessionId = data.session_id || null;
-  append('RESET session=' + sessionId + '\\n' + JSON.stringify(data, null, 2));
-}
-async function doStep() {
-  const tool_id = document.getElementById('tool').value;
-  const input   = document.getElementById('query').value;
-  const body    = { tool_id };
-  if (tool_id === 'commit')         body.answer = input;
-  else if (tool_id === 'calculator') body.expression = input;
-  else if (tool_id === 'code_executor') body.code_snippet = input;
-  else                              body.query = input;
-  const url = sessionId ? '/step?session_id=' + encodeURIComponent(sessionId) : '/step';
-  const res = await fetch(url, { method: 'POST', headers: {'Content-Type':'application/json'}, body: JSON.stringify(body) });
-  const data = await res.json();
-  append('STEP tool_id=' + tool_id + '\\n' + JSON.stringify(data, null, 2));
-}
-</script>
-</body>
-</html>
-"""

 """FastAPI server for CostAwareToolEnv.
 Exposes the OpenEnv standard endpoints:
+  POST /reset          -> OrchestratorObservation + OrchestratorState
+  POST /step           -> OrchestratorObservation + reward + done + state
+  GET  /health         -> {"status": "ok"}
+  GET  /tools          -> canonical tool manifest
+  GET  /web            -> simple demo UI
+  GET  /docs           -> OpenAPI (automatic)
 """
 from __future__ import annotations
+import copy
 import os
 import uuid
 from contextlib import asynccontextmanager
+from typing import Any, Callable, Dict, List, Optional
 from fastapi import FastAPI, HTTPException
 from fastapi.responses import HTMLResponse
 from env.config import EnvConfig
 from env.environment import CostAwareToolEnvironment
 from env.models import OrchestratorAction
+from tools import build_tool_catalog, build_tool_registry, catalog_as_dicts, validate_tool_costs
 class ResetRequest(BaseModel):
     seed: Optional[int] = None
     config_overrides: Optional[Dict[str, Any]] = None
 class StepRequest(BaseModel):
     tool_id: str
+    query: Optional[str] = None
+    expression: Optional[str] = None
     code_snippet: Optional[str] = None
+    answer: Optional[str] = None
+    metadata: Optional[Dict[str, Any]] = None
+def _merge_config(base: EnvConfig, overrides: Optional[Dict[str, Any]]) -> EnvConfig:
+    cfg = copy.deepcopy(base)
+    if not overrides:
+        return cfg
+    for key, value in overrides.items():
+        if not hasattr(cfg, key):
+            raise ValueError(f"Unknown config override: {key}")
+        current = getattr(cfg, key)
+        if isinstance(current, dict) and isinstance(value, dict):
+            merged = copy.deepcopy(current)
+            merged.update(value)
+            setattr(cfg, key, merged)
+        else:
+            setattr(cfg, key, value)
+    return cfg
+def _build_demo_html(tool_catalog: List[Any]) -> str:
+    tool_options = "\n".join(
+        f'  <option value="{spec.tool_id}">{spec.label} (cost {spec.cost}) — {spec.purpose}</option>'
+        for spec in tool_catalog
+    )
+    return f"""<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>CostAwareToolEnv</title>
+<style>
+  body {{ font-family: monospace; max-width: 860px; margin: 40px auto; padding: 0 20px; }}
+  h1   {{ color: #333; }}
+  pre  {{ background: #f4f4f4; padding: 12px; border-radius: 6px; overflow-x: auto; }}
+  button {{ padding: 8px 16px; margin: 4px; cursor: pointer; }}
+  input, select, textarea {{ width: 100%; padding: 6px; margin: 4px 0; box-sizing: border-box; }}
+  label {{ font-weight: bold; }}
+  .tool-btn {{ background: #e8f0fe; border: 1px solid #4a90e2; border-radius: 4px; }}
+  .tool-btn:hover {{ background: #cfe1ff; }}
+  #log {{ max-height: 480px; overflow-y: auto; }}
+</style>
+</head>
+<body>
+<h1>CostAwareToolEnv</h1>
+<p>Multi-tool cost-aware RL environment with explicit tool routing and sandboxed execution.</p>
+<button onclick="doReset()">Reset Episode</button>
+<hr>
+<label>Tool:</label>
+<select id="tool">
+{tool_options}
+</select>
+<label>Query / Expression / Code / Answer:</label>
+<textarea id="query" rows="3" placeholder="Enter query or answer..."></textarea>
+<button class="tool-btn" onclick="doStep()">Step</button>
+<hr>
+<pre id="log">Click "Reset Episode" to start.</pre>
+<script>
+const log = document.getElementById('log');
+let sessionId = null;
+function append(text) {{ log.textContent += text + '\\n---\\n'; log.scrollTop = log.scrollHeight; }}
+async function doReset() {{
+  log.textContent = '';
+  const res = await fetch('/reset', {{ method: 'POST', headers: {{'Content-Type':'application/json'}}, body: JSON.stringify({{seed: 42}}) }});
+  const data = await res.json();
+  sessionId = data.session_id || null;
+  append('RESET session=' + sessionId + '\\n' + JSON.stringify(data, null, 2));
+}}
+async function doStep() {{
+  const tool_id = document.getElementById('tool').value;
+  const input   = document.getElementById('query').value;
+  const body    = {{ tool_id }};
+  if (tool_id === 'commit')         body.answer = input;
+  else if (tool_id === 'calculator') body.expression = input;
+  else if (tool_id === 'code_executor') body.code_snippet = input;
+  else                              body.query = input;
+  const url = sessionId ? '/step?session_id=' + encodeURIComponent(sessionId) : '/step';
+  const res = await fetch(url, {{ method: 'POST', headers: {{'Content-Type':'application/json'}}, body: JSON.stringify(body) }});
+  const data = await res.json();
+  append('STEP tool_id=' + tool_id + '\\n' + JSON.stringify(data, null, 2));
+}}
+</script>
+</body>
+</html>
+"""
+def create_app(
+    config: Optional[EnvConfig] = None,
+    tools: Optional[Dict[str, Any]] = None,
+    dataset: Optional[List[Dict[str, Any]]] = None,
+    load_dataset_fn: Callable[..., List[Dict[str, Any]]] = load_all,
+    build_registry_fn: Callable[[EnvConfig | None], Dict[str, Any]] = build_tool_registry,
+) -> FastAPI:
+    base_config = config or EnvConfig()
+    validate_tool_costs(base_config)
+    dataset_cache = dataset
+    default_env: Optional[CostAwareToolEnvironment] = None
     sessions: Dict[str, CostAwareToolEnvironment] = {}
+    tool_catalog = build_tool_catalog(base_config)
+    demo_html = _build_demo_html(tool_catalog)
+    def get_dataset() -> List[Dict[str, Any]]:
+        nonlocal dataset_cache
+        if dataset_cache is None:
+            dataset_cache = load_dataset_fn(split=base_config.data_split, max_per_domain=200)
+        return dataset_cache
+    def make_env(effective_config: EnvConfig) -> CostAwareToolEnvironment:
+        registry = tools if tools is not None else build_registry_fn(effective_config)
+        return CostAwareToolEnvironment(
+            config=effective_config,
+            tools=registry,
+            dataset=get_dataset(),
+        )
+    def get_default_env() -> CostAwareToolEnvironment:
+        nonlocal default_env
+        if default_env is None:
+            default_env = make_env(base_config)
+        return default_env
     @asynccontextmanager
     async def lifespan(app: FastAPI):
     def health():
         return {"status": "ok"}
+    @app.get("/tools")
+    def tools_manifest():
+        return catalog_as_dicts(base_config)
     @app.post("/reset")
     def reset(req: ResetRequest):
+        try:
+            cfg = _merge_config(base_config, req.config_overrides)
+        except ValueError as exc:
+            raise HTTPException(status_code=422, detail=str(exc))
+        env = make_env(cfg)
         obs, state = env.reset(seed=req.seed)
         session_id = str(uuid.uuid4())
         sessions[session_id] = env
         return {
+            "session_id": session_id,
             "observation": obs.model_dump(),
+            "state": state.model_dump(),
         }
     @app.post("/step")
     def step(req: StepRequest, session_id: Optional[str] = None):
+        if session_id is None:
+            env = get_default_env()
+        else:
+            env = sessions.get(session_id)
+            if env is None:
+                raise HTTPException(status_code=404, detail="Unknown session_id")
         action = OrchestratorAction(
             tool_id=req.tool_id,
             query=req.query or "",
         except ValueError as exc:
             raise HTTPException(status_code=422, detail=str(exc))
         if done and session_id and session_id in sessions:
             del sessions[session_id]
         return {
             "observation": obs.model_dump(),
+            "reward": reward,
+            "done": done,
+            "state": state.model_dump(),
         }
     @app.get("/web", response_class=HTMLResponse)
     def web_ui():
+        return demo_html
     return app
 app = create_app()

client.py CHANGED Viewed

@@ -1,123 +1,4 @@
-"""Compatibility shim — real code lives in ceramic/client.py.
-Mirrors the SearchEconomicsEnv CeramicClient interface so the two
-environments share the same retrieval backend.
-Priority for the API key:
-  1. CERAMIC_API_KEY env var
-  2. SEE_CERAMIC_API_KEY env var (HF Spaces compatibility)
-  3. Falls back to FallbackCeramicClient (offline, deterministic)
-"""
 from __future__ import annotations
-import hashlib
-import os
-import time
-from dataclasses import dataclass, field
-from typing import List, Optional
-import httpx
-# ---------------------------------------------------------------------------
-# Result model
-# ---------------------------------------------------------------------------
-@dataclass
-class SearchResult:
-    title: str
-    url: str
-    description: str
-    score: float = 0.0
-# ---------------------------------------------------------------------------
-# Live client
-# ---------------------------------------------------------------------------
-class CeramicClient:
-    """Thin wrapper around the Ceramic search API."""
-    BASE_URL = "https://api.ceramic.ai/v1"
-    def __init__(self, api_key: str):
-        self._key = api_key
-        self._client = httpx.Client(
-            headers={"Authorization": f"Bearer {api_key}"},
-            timeout=10.0,
-        )
-    def search(self, query: str, top_k: int = 5) -> List[SearchResult]:
-        resp = self._client.post(
-            f"{self.BASE_URL}/search",
-            json={"query": query, "top_k": top_k},
-        )
-        resp.raise_for_status()
-        data = resp.json()
-        results = []
-        for item in data.get("results", []):
-            results.append(SearchResult(
-                title=item.get("title", ""),
-                url=item.get("url", ""),
-                description=item.get("description", ""),
-                score=float(item.get("score", 0.0)),
-            ))
-        return results
-    def close(self):
-        self._client.close()
-    def __enter__(self):
-        return self
-    def __exit__(self, *args):
-        self.close()
-# ---------------------------------------------------------------------------
-# Offline fallback
-# ---------------------------------------------------------------------------
-class FallbackCeramicClient:
-    """Deterministic offline client — used when no API key is set."""
-    def search(self, query: str, top_k: int = 5) -> List[SearchResult]:
-        # Stable hash → reproducible fake results per query
-        h = int(hashlib.sha256(query.encode()).hexdigest(), 16)
-        results = []
-        for i in range(min(top_k, 3)):
-            seed = (h + i) % 10_000
-            results.append(SearchResult(
-                title=f"Result {seed}: {query[:40]}",
-                url=f"https://fallback.example.com/doc/{seed}",
-                description=f"Offline fallback result #{i+1} for query: {query}",
-                score=round(0.9 - i * 0.15, 3),
-            ))
-        return results
-    def close(self):
-        pass
-    def __enter__(self):
-        return self
-    def __exit__(self, *args):
-        pass
-# ---------------------------------------------------------------------------
-# Factory
-# ---------------------------------------------------------------------------
-_DEFAULT_KEY = "cer_sk_live_543fe74e79df_eyJvcmdfaWQiOiJvcmdfMDFLTlpINkU5RVNDTUowUUoyREpINFZWWEYiLCJrZXlfaWQiOiI1NDNmZTc0ZTc5ZGYifQ.k8I4Aljsk29y4Uki37Wxfd7QZHs40XSJVNBNnfksCtM"
-def get_ceramic_client() -> CeramicClient | FallbackCeramicClient:
-    key = (
-        os.environ.get("CERAMIC_API_KEY")
-        or os.environ.get("SEE_CERAMIC_API_KEY")
-        or _DEFAULT_KEY
-    )
-    if key:
-        return CeramicClient(api_key=key)
-    return FallbackCeramicClient()

+"""Compatibility shim for the legacy top-level Ceramic client import path."""
 from __future__ import annotations
+from ceramic.client import *  # noqa: F401,F403

env/environment.py CHANGED Viewed

@@ -24,6 +24,7 @@ from .models import (
     TOOL_IDS,
 )
 from .reward import commit_reward, step_reward
 class CostAwareToolEnvironment:
@@ -95,6 +96,8 @@ class CostAwareToolEnvironment:
     ) -> Tuple[OrchestratorObservation, float, bool, OrchestratorState]:
         if self._episode_done:
             raise RuntimeError("Episode is done. Call reset() first.")
         tool_id = action.tool_id
         if tool_id not in TOOL_IDS:
@@ -160,30 +163,17 @@ class CostAwareToolEnvironment:
         if budget_after > state.total_budget:
             r = config.incorrect_reward
             self._episode_done = True
-            obs = self._make_obs(reward=r, question_done=True, done=True)
             return obs, r, True, state
         t0 = time.perf_counter()
-        tool_fn = self.tools.get(tool_id)
-        if tool_fn is None:
-            tool_result = ToolResult(
-                tool_id=tool_id, cost=cost,
-                output="[Tool not available in this environment]",
-                latency_s=0.0,
-                error="not_available",
-            )
-        else:
-            try:
-                tool_result = tool_fn(action)
-                tool_result.cost = cost
-            except Exception as exc:
-                tool_result = ToolResult(
-                    tool_id=tool_id, cost=cost,
-                    output=f"[Error: {exc}]",
-                    latency_s=time.perf_counter() - t0,
-                    error=str(exc),
-                )
         tool_result.latency_s = time.perf_counter() - t0
         state.budget_spent = budget_after
         state.budget_remaining_ratio = max(
@@ -234,6 +224,8 @@ class CostAwareToolEnvironment:
     ) -> OrchestratorObservation:
         state = self._state
         cfg   = self.config
         if 0 <= self._current_q_idx < len(self._questions):
             q_entry = self._questions[self._current_q_idx]

     TOOL_IDS,
 )
 from .reward import commit_reward, step_reward
+from tools import dispatch_tool
 class CostAwareToolEnvironment:
     ) -> Tuple[OrchestratorObservation, float, bool, OrchestratorState]:
         if self._episode_done:
             raise RuntimeError("Episode is done. Call reset() first.")
+        if self._state is None:
+            raise RuntimeError("Call reset() first.")
         tool_id = action.tool_id
         if tool_id not in TOOL_IDS:
         if budget_after > state.total_budget:
             r = config.incorrect_reward
             self._episode_done = True
+            obs = self._make_obs(
+                reward=r,
+                question_done=True,
+                done=True,
+            )
             return obs, r, True, state
         t0 = time.perf_counter()
+        tool_result = dispatch_tool(tool_id, action, self.tools)
         tool_result.latency_s = time.perf_counter() - t0
+        tool_result.cost = cost
         state.budget_spent = budget_after
         state.budget_remaining_ratio = max(
     ) -> OrchestratorObservation:
         state = self._state
         cfg   = self.config
+        if state is None:
+            raise RuntimeError("Call reset() first.")
         if 0 <= self._current_q_idx < len(self._questions):
             q_entry = self._questions[self._current_q_idx]

environment.py CHANGED Viewed

@@ -1,307 +1,9 @@
-"""Compatibility shim — real code lives in env/environment.py.
-Step logic:
-  - Agent receives an OrchestratorObservation with the current question,
-    budget, context, and available tools.
-  - Agent picks a tool_id and optional query / code_snippet / answer.
-  - Environment dispatches to the appropriate tool, charges cost, appends
-    result to context_window, and returns the next observation + reward.
-  - Episode ends when budget is exhausted OR all questions are answered.
 """
 from __future__ import annotations
-import time
-import uuid
-from typing import Any, Dict, List, Optional, Tuple
-from env.answer_grading import grade
-from env.config import EnvConfig
-from env.models import (
-    OrchestratorAction,
-    OrchestratorObservation,
-    OrchestratorState,
-    ToolResult,
-    TOOL_IDS,
-)
-from env.reward import commit_reward, step_reward
-class CostAwareToolEnvironment:
-    """
-    OpenEnv-compatible RL environment for multi-tool cost-aware QA.
-    Supports external tool injection so the server can wire in live
-    Ceramic, code executor, etc.  Tools are callables with signature:
-        tool_fn(action: OrchestratorAction) -> ToolResult
-    """
-    def __init__(
-        self,
-        config: Optional[EnvConfig] = None,
-        tools: Optional[Dict[str, Any]] = None,
-        dataset: Optional[List[Dict[str, Any]]] = None,
-    ):
-        self.config  = config or EnvConfig()
-        self.tools   = tools or {}      # tool_id -> callable
-        self.dataset = dataset or []    # List of {question, answer, domain}
-        self._state: Optional[OrchestratorState] = None
-        self._questions: List[Dict[str, Any]] = []
-        self._current_q_idx: int = 0
-        self._context_window: List[str] = []
-        self._tools_used_this_q: List[str] = []
-        self._steps_this_q: int = 0
-        self._episode_done: bool = False
-    # -----------------------------------------------------------------------
-    # Reset
-    # -----------------------------------------------------------------------
-    def reset(self, seed: Optional[int] = None) -> Tuple[OrchestratorObservation, OrchestratorState]:
-        import random
-        rng = random.Random(seed if seed is not None else self.config.seed)
-        # Sample questions according to domain_mix
-        questions = _sample_questions(self.dataset, self.config, rng)
-        self._questions = questions
-        self._current_q_idx = 0
-        self._episode_done = False
-        self._state = OrchestratorState(
-            episode_id=str(uuid.uuid4()),
-            total_budget=self.config.total_budget,
-            budget_spent=0,
-            questions_answered=0,
-            total_correct=0,
-            current_accuracy=0.0,
-            budget_remaining_ratio=1.0,
-            current_question_idx=0,
-            current_question_steps=0,
-        )
-        self._context_window = []
-        self._tools_used_this_q = []
-        self._steps_this_q = 0
-        obs = self._make_obs(reward=None, question_done=False, done=False)
-        return obs, self._state
-    # -----------------------------------------------------------------------
-    # Step
-    # -----------------------------------------------------------------------
-    def step(
-        self, action: OrchestratorAction
-    ) -> Tuple[OrchestratorObservation, float, bool, OrchestratorState]:
-        if self._episode_done:
-            raise RuntimeError("Episode is done. Call reset() first.")
-        tool_id = action.tool_id
-        if tool_id not in TOOL_IDS:
-            raise ValueError(f"Unknown tool_id: {tool_id!r}. Valid: {TOOL_IDS}")
-        state   = self._state
-        config  = self.config
-        # Guard against exhausted question list (can happen after last commit)
-        if self._current_q_idx >= len(self._questions):
-            self._episode_done = True
-            raise RuntimeError("Episode is done. Call reset() first.")
-        q_entry = self._questions[self._current_q_idx]
-        gold    = q_entry["answer"]
-        # ---- Commit ---------------------------------------------------
-        if tool_id == "commit":
-            raw_pred = action.answer or ""
-            em, f1, quality = grade(raw_pred, gold)
-            r = commit_reward(
-                quality=quality,
-                budget_remaining_ratio=state.budget_remaining_ratio,
-                config=config,
-            )
-            # Count correct
-            is_correct = (
-                em if config.grade_count_correct_mode == "em_only"
-                else (em or f1 >= config.f1_count_threshold)
-            )
-            state.questions_answered += 1
-            state.total_correct += int(is_correct)
-            state.current_accuracy = state.total_correct / state.questions_answered
-            # Advance to next question or end episode
-            self._current_q_idx += 1
-            self._context_window = []
-            self._tools_used_this_q = []
-            self._steps_this_q = 0
-            episode_done = (
-                self._current_q_idx >= len(self._questions)
-                or state.budget_spent >= state.total_budget
-            )
-            self._episode_done = episode_done
-            state.current_question_idx = self._current_q_idx
-            state.current_question_steps = 0
-            obs = self._make_obs(
-                reward=r,
-                question_done=True,
-                done=episode_done,
-                last_tool_result=ToolResult(
-                    tool_id="commit", cost=0,
-                    output=f"EM={em} F1={f1:.3f} quality={quality:.3f}"
-                ),
-            )
-            return obs, r, episode_done, state
-        # ---- Tool call ------------------------------------------------
-        cost = config.tool_costs.get(tool_id, 0)
-        budget_after = state.budget_spent + cost
-        # If over budget, force commit penalty
-        if budget_after > state.total_budget:
-            r = config.incorrect_reward
-            self._episode_done = True
-            obs = self._make_obs(reward=r, question_done=True, done=True)
-            return obs, r, True, state
-        # Dispatch tool
-        t0 = time.perf_counter()
-        tool_fn = self.tools.get(tool_id)
-        if tool_fn is None:
-            tool_result = ToolResult(
-                tool_id=tool_id, cost=cost,
-                output="[Tool not available in this environment]",
-                latency_s=0.0,
-                error="not_available",
-            )
-        else:
-            try:
-                tool_result = tool_fn(action)
-                tool_result.cost = cost
-            except Exception as exc:
-                tool_result = ToolResult(
-                    tool_id=tool_id, cost=cost,
-                    output=f"[Error: {exc}]",
-                    latency_s=time.perf_counter() - t0,
-                    error=str(exc),
-                )
-        tool_result.latency_s = time.perf_counter() - t0
-        # Charge cost and update state
-        state.budget_spent = budget_after
-        state.budget_remaining_ratio = max(
-            0.0, (state.total_budget - state.budget_spent) / state.total_budget
-        )
-        state.step_count += 1
-        state.current_question_steps += 1
-        self._steps_this_q += 1
-        self._tools_used_this_q.append(tool_id)
-        self._context_window.append(f"[{tool_id}] {tool_result.output}")
-        r = step_reward(tool_id, config)
-        # Auto-commit if max steps reached
-        question_done = self._steps_this_q >= config.max_steps_per_question
-        episode_done  = (
-            state.budget_spent >= state.total_budget
-            or (question_done and self._current_q_idx + 1 >= len(self._questions))
-        )
-        if question_done and not episode_done:
-            self._current_q_idx += 1
-            state.questions_answered += 1
-            self._context_window = []
-            self._tools_used_this_q = []
-            self._steps_this_q = 0
-            state.current_question_idx = self._current_q_idx
-            state.current_question_steps = 0
-        self._episode_done = episode_done
-        obs = self._make_obs(
-            reward=r,
-            question_done=question_done,
-            done=episode_done,
-            last_tool_result=tool_result,
-        )
-        return obs, r, episode_done, state
-    # -----------------------------------------------------------------------
-    # Internal helpers
-    # -----------------------------------------------------------------------
-    def _make_obs(
-        self,
-        reward: Optional[float],
-        question_done: bool,
-        done: bool,
-        last_tool_result: Optional[ToolResult] = None,
-    ) -> OrchestratorObservation:
-        state = self._state
-        cfg   = self.config
-        if 0 <= self._current_q_idx < len(self._questions):
-            q_entry = self._questions[self._current_q_idx]
-        elif self._questions:
-            # Episode finished — repeat last question info (obs is terminal anyway)
-            q_entry = self._questions[-1]
-        else:
-            q_entry = {"question": "", "answer": "", "domain": ""}
-        return OrchestratorObservation(
-            question=q_entry.get("question", ""),
-            question_idx=self._current_q_idx,
-            domain=q_entry.get("domain", ""),
-            question_embedding=[],          # populated by server if needed
-            total_budget=cfg.total_budget,
-            budget_spent=state.budget_spent,
-            budget_remaining=state.total_budget - state.budget_spent,
-            budget_remaining_ratio=state.budget_remaining_ratio,
-            tools_used_this_question=list(self._tools_used_this_q),
-            steps_used_this_question=self._steps_this_q,
-            max_steps_per_question=cfg.max_steps_per_question,
-            last_tool_result=last_tool_result,
-            context_window=list(self._context_window),
-            step_idx=state.step_count,
-            questions_remaining=max(0, len(self._questions) - self._current_q_idx - 1),
-            questions_answered=state.questions_answered,
-            accuracy_so_far=state.current_accuracy,
-            question_done=question_done,
-            done=done,
-            reward=reward,
-        )
-# ---------------------------------------------------------------------------
-# Dataset sampling helper
-# ---------------------------------------------------------------------------
-def _sample_questions(
-    dataset: List[Dict[str, Any]],
-    config: EnvConfig,
-    rng: Any,
-) -> List[Dict[str, Any]]:
-    """Sample `config.num_questions` questions according to domain_mix."""
-    by_domain: Dict[str, List[Dict]] = {}
-    for item in dataset:
-        d = item.get("domain", "hotpotqa")
-        by_domain.setdefault(d, []).append(item)
-    selected = []
-    for domain, frac in config.domain_mix.items():
-        n = round(config.num_questions * frac)
-        pool = by_domain.get(domain, [])
-        if pool and n > 0:
-            selected.extend(rng.sample(pool, min(n, len(pool))))
-    # Guarantee at least num_questions items by filling from the full dataset
-    if len(selected) < config.num_questions and dataset:
-        remaining = [d for d in dataset if d not in selected]
-        rng.shuffle(remaining)
-        selected.extend(remaining[: config.num_questions - len(selected)])
-    if config.shuffle_questions:
-        rng.shuffle(selected)
-    return selected[: config.num_questions]

+"""Compatibility shim for the legacy top-level import path.
+The real environment implementation lives in :mod:`env.environment`.
+This module stays intentionally thin so the two orchestrator entrypoints
+cannot drift apart again.
 """
 from __future__ import annotations
+from env.environment import *  # noqa: F401,F403

requirements.txt CHANGED Viewed

@@ -6,3 +6,4 @@ datasets>=2.18.0
 httpx>=0.27.0
 requests>=2.31.0
 together>=1.2.0

 httpx>=0.27.0
 requests>=2.31.0
 together>=1.2.0
+pytest>=8.0.0

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,45 @@

+from __future__ import annotations
+import sys
+from pathlib import Path
+import pytest
+from fastapi.testclient import TestClient
+sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
+from app import create_app
+from env.config import EnvConfig
+from env.models import TOOL_IDS, OrchestratorAction, ToolResult
+def make_stub_registry():
+    def make_tool(tool_id: str):
+        def _tool(action: OrchestratorAction) -> ToolResult:
+            payload = action.query or action.expression or action.code_snippet or action.answer or ""
+            return ToolResult(tool_id=tool_id, output=f"{tool_id}:{payload}".rstrip(":"))
+        return _tool
+    return {tool_id: make_tool(tool_id) for tool_id in TOOL_IDS}
+@pytest.fixture()
+def sample_dataset():
+    return [
+        {"question": "What is 2 + 2?", "answer": "4", "domain": "math"},
+        {"question": "What is 3 + 1?", "answer": "4", "domain": "math"},
+    ]
+@pytest.fixture()
+def app_client(sample_dataset):
+    cfg = EnvConfig(
+        num_questions=2,
+        total_budget=5.0,
+        max_steps_per_question=4,
+        shuffle_questions=False,
+        domain_mix={"math": 1.0},
+    )
+    app = create_app(config=cfg, tools=make_stub_registry(), dataset=sample_dataset)
+    return TestClient(app)

tests/test_app.py ADDED Viewed

	@@ -0,0 +1,82 @@

+from __future__ import annotations
+def test_reset_returns_clean_session(app_client):
+    response = app_client.post("/reset", json={"seed": 42})
+    assert response.status_code == 200
+    payload = response.json()
+    assert payload["session_id"]
+    observation = payload["observation"]
+    state = payload["state"]
+    assert observation["budget_spent"] == 0
+    assert observation["question_idx"] == 0
+    assert observation["done"] is False
+    assert observation["tools_used_this_question"] == []
+    assert state["budget_spent"] == 0
+    assert state["current_question_idx"] == 0
+    assert state["step_count"] == 0
+def test_step_charges_the_right_cost(app_client):
+    reset = app_client.post("/reset", json={"seed": 42})
+    session_id = reset.json()["session_id"]
+    response = app_client.post(
+        "/step",
+        params={"session_id": session_id},
+        json={"tool_id": "calculator", "expression": "2 + 2"},
+    )
+    assert response.status_code == 200
+    payload = response.json()
+    assert payload["reward"] == -0.1
+    assert payload["state"]["budget_spent"] == 0.1
+    assert payload["observation"]["last_tool_result"]["tool_id"] == "calculator"
+    assert payload["observation"]["last_tool_result"]["cost"] == 0.1
+def test_commit_advances_question_state_correctly(app_client):
+    reset = app_client.post("/reset", json={"seed": 42})
+    session_id = reset.json()["session_id"]
+    response = app_client.post(
+        "/step",
+        params={"session_id": session_id},
+        json={"tool_id": "commit", "answer": "4"},
+    )
+    assert response.status_code == 200
+    payload = response.json()
+    assert payload["done"] is False
+    assert payload["state"]["questions_answered"] == 1
+    assert payload["state"]["current_question_idx"] == 1
+    assert payload["observation"]["question_done"] is True
+def test_episode_termination_and_cleanup_behave_as_expected(app_client):
+    reset = app_client.post("/reset", json={"seed": 42})
+    session_id = reset.json()["session_id"]
+    first = app_client.post(
+        "/step",
+        params={"session_id": session_id},
+        json={"tool_id": "commit", "answer": "4"},
+    )
+    assert first.status_code == 200
+    second = app_client.post(
+        "/step",
+        params={"session_id": session_id},
+        json={"tool_id": "commit", "answer": "4"},
+    )
+    assert second.status_code == 200
+    assert second.json()["done"] is True
+    follow_up = app_client.post(
+        "/step",
+        params={"session_id": session_id},
+        json={"tool_id": "commit", "answer": "4"},
+    )
+    assert follow_up.status_code == 404

tests/test_code_executor.py ADDED Viewed

	@@ -0,0 +1,45 @@

+from __future__ import annotations
+from env.models import OrchestratorAction
+from tools.code_executor import code_executor_tool
+def test_code_executor_empty_input():
+    result = code_executor_tool(OrchestratorAction(tool_id="code_executor"))
+    assert result.error == "empty"
+    assert "No code provided" in result.output
+def test_code_executor_runtime_error():
+    result = code_executor_tool(
+        OrchestratorAction(tool_id="code_executor", code_snippet="print(1 / 0)")
+    )
+    assert result.error is not None
+    assert "division by zero" in result.output.lower()
+def test_code_executor_blocks_imports():
+    result = code_executor_tool(
+        OrchestratorAction(tool_id="code_executor", code_snippet="import os\nprint('hi')")
+    )
+    assert result.error == "sandbox_violation"
+    assert "import statements are blocked" in result.output
+def test_code_executor_blocks_unsafe_builtins():
+    result = code_executor_tool(
+        OrchestratorAction(tool_id="code_executor", code_snippet="open('tmp.txt', 'w')")
+    )
+    assert result.error == "sandbox_violation"
+    assert "name 'open' is blocked" in result.output
+def test_code_executor_blocks_escape_attempts():
+    result = code_executor_tool(
+        OrchestratorAction(
+            tool_id="code_executor",
+            code_snippet="().__class__.__mro__[1].__subclasses__()",
+        )
+    )
+    assert result.error == "sandbox_violation"
+    assert "__class__" in result.output or "__subclasses__" in result.output

tests/test_tools.py ADDED Viewed

	@@ -0,0 +1,64 @@

+from __future__ import annotations
+from urllib.error import HTTPError
+from env.models import OrchestratorAction
+from tools.calculator import calculator_tool
+from tools.code_executor import code_executor_tool
+from tools.commit import commit_tool
+from tools.llm_reason import llm_reason_tool
+from tools.ceramic_search import make_search_tool
+from tools.wiki_lookup import wiki_lookup_tool
+def test_calculator_happy_path():
+    result = calculator_tool(OrchestratorAction(tool_id="calculator", expression="2 + 2 * 3"))
+    assert result.output == "8"
+    assert result.error is None
+def test_calculator_invalid_input():
+    result = calculator_tool(OrchestratorAction(tool_id="calculator", expression="open('x')"))
+    assert result.error is not None
+    assert result.output.startswith("[Calc error:")
+def test_search_fallback_is_deterministic(monkeypatch):
+    monkeypatch.delenv("CERAMIC_API_KEY", raising=False)
+    monkeypatch.delenv("SEE_CERAMIC_API_KEY", raising=False)
+    tool = make_search_tool()
+    action = OrchestratorAction(tool_id="ceramic_search", query="Eiffel Tower")
+    first = tool(action)
+    second = tool(action)
+    assert first.error is None
+    assert first.output == second.output
+    assert "Eiffel Tower" in first.output
+def test_wiki_lookup_not_found(monkeypatch):
+    def fake_urlopen(*args, **kwargs):
+        raise HTTPError(url="https://example.com", code=404, msg="not found", hdrs=None, fp=None)
+    monkeypatch.setattr("urllib.request.urlopen", fake_urlopen)
+    result = wiki_lookup_tool(OrchestratorAction(tool_id="wiki_lookup", query="Definitely Not A Real Page"))
+    assert result.error == "not_found"
+    assert "no article found" in result.output.lower()
+def test_llm_reason_no_api_key(monkeypatch):
+    monkeypatch.delenv("TOGETHER_API_KEY", raising=False)
+    monkeypatch.delenv("TOGETHER_KEY", raising=False)
+    result = llm_reason_tool(OrchestratorAction(tool_id="llm_reason", query="Explain gravity"))
+    assert result.error == "no_api_key"
+    assert "not configured" in result.output.lower()
+def test_commit_passthrough_behavior():
+    result = commit_tool(OrchestratorAction(tool_id="commit", answer="  1889  "))
+    assert result.tool_id == "commit"
+    assert result.output == "Committed answer: 1889"

tools/__init__.py CHANGED Viewed

@@ -1,29 +1,20 @@
-"""Tool registry for CostAwareToolEnv.
-Each tool is a callable: (action: OrchestratorAction) -> ToolResult
-"""
 from __future__ import annotations
-from typing import Callable, Dict
-from env.config import EnvConfig
-from env.models import OrchestratorAction, ToolResult
-from .calculator import calculator_tool
-from .ceramic_search import make_search_tool
-from .code_executor import code_executor_tool
-from .commit import commit_tool
-from .llm_reason import llm_reason_tool
-from .wiki_lookup import wiki_lookup_tool
-def build_tool_registry(config: EnvConfig | None = None) -> Dict[str, Callable]:
-    """Return a mapping of tool_id → tool function."""
-    return {
-        "ceramic_search": make_search_tool(),
-        "calculator":     calculator_tool,
-        "wiki_lookup":    wiki_lookup_tool,
-        "code_executor":  code_executor_tool,
-        "llm_reason":     llm_reason_tool,
-        "commit":         commit_tool,
-    }

+"""Public tool layer for CostAwareToolEnv."""
 from __future__ import annotations
+from .runtime import (
+    ToolSpec,
+    build_tool_catalog,
+    build_tool_registry,
+    catalog_as_dicts,
+    dispatch_tool,
+    validate_tool_costs,
+)
+__all__ = [
+    "ToolSpec",
+    "build_tool_catalog",
+    "build_tool_registry",
+    "catalog_as_dicts",
+    "dispatch_tool",
+    "validate_tool_costs",
+]

tools/calculator.py CHANGED Viewed

@@ -24,6 +24,15 @@ _SAFE_OPS = {
     ast.UAdd: operator.pos,
 }
 _SAFE_FUNCS: dict[str, Any] = {
     "abs": abs, "round": round, "min": min, "max": max,
     "sqrt": math.sqrt, "log": math.log, "log2": math.log2,
@@ -53,6 +62,17 @@ def _safe_eval(node: ast.AST) -> Any:
         if op_type not in _SAFE_OPS:
             raise ValueError(f"Unsupported unary: {op_type.__name__}")
         return _SAFE_OPS[op_type](_safe_eval(node.operand))
     if isinstance(node, ast.Call):
         func = _safe_eval(node.func)
         if not callable(func):

     ast.UAdd: operator.pos,
 }
+_SAFE_COMPARISONS = {
+    ast.Eq: operator.eq,
+    ast.NotEq: operator.ne,
+    ast.Lt: operator.lt,
+    ast.LtE: operator.le,
+    ast.Gt: operator.gt,
+    ast.GtE: operator.ge,
+}
 _SAFE_FUNCS: dict[str, Any] = {
     "abs": abs, "round": round, "min": min, "max": max,
     "sqrt": math.sqrt, "log": math.log, "log2": math.log2,
         if op_type not in _SAFE_OPS:
             raise ValueError(f"Unsupported unary: {op_type.__name__}")
         return _SAFE_OPS[op_type](_safe_eval(node.operand))
+    if isinstance(node, ast.Compare):
+        left = _safe_eval(node.left)
+        for op, comparator in zip(node.ops, node.comparators):
+            op_type = type(op)
+            if op_type not in _SAFE_COMPARISONS:
+                raise ValueError(f"Unsupported comparison: {op_type.__name__}")
+            right = _safe_eval(comparator)
+            if not _SAFE_COMPARISONS[op_type](left, right):
+                return False
+            left = right
+        return True
     if isinstance(node, ast.Call):
         func = _safe_eval(node.func)
         if not callable(func):

tools/code_executor.py CHANGED Viewed

@@ -1,35 +1,169 @@
 """Restricted Python code executor.
-Runs code in a sandboxed namespace — blocks os/sys/subprocess imports
-and captures stdout. Intended for math / algorithmic tasks.
 """
 from __future__ import annotations
-import io
-import sys
 import contextlib
 from env.models import OrchestratorAction, ToolResult
-_BLOCKED_MODULES = frozenset({
-    "os", "sys", "subprocess", "socket", "shutil", "pathlib",
-    "importlib", "builtins", "ctypes", "multiprocessing", "threading",
-    "signal", "pty", "fcntl", "resource", "gc", "inspect",
-})
 _MAX_OUTPUT_CHARS = 2000
-class _BlockedImport:
-    """Raise on any import of blocked modules."""
-    def __init__(self, original_import):
-        self._orig = original_import
-    def __call__(self, name, *args, **kwargs):
-        base = name.split(".")[0]
-        if base in _BLOCKED_MODULES:
-            raise ImportError(f"Module '{name}' is not allowed in code_executor")
-        return self._orig(name, *args, **kwargs)
 def code_executor_tool(action: OrchestratorAction) -> ToolResult:
@@ -38,26 +172,21 @@ def code_executor_tool(action: OrchestratorAction) -> ToolResult:
         return ToolResult(tool_id="code_executor", output="[No code provided]", error="empty")
     stdout_buf = io.StringIO()
-    safe_globals = {
-        "__builtins__": {
-            k: v for k, v in __builtins__.items()  # type: ignore[union-attr]
-            if k not in ("open", "exec", "eval", "compile", "__import__")
-        } if isinstance(__builtins__, dict) else {
-            k: getattr(__builtins__, k) for k in dir(__builtins__)
-            if k not in ("open", "exec", "eval", "compile", "__import__")
-        },
-        "__import__": _BlockedImport(__import__),
-        "print": lambda *a, **kw: print(*a, **kw, file=stdout_buf),
     }
     try:
         with contextlib.redirect_stdout(stdout_buf):
-            exec(compile(code, "<code_executor>", "exec"), safe_globals)  # noqa: S102
         output = stdout_buf.getvalue()[:_MAX_OUTPUT_CHARS] or "[Code ran, no output]"
         return ToolResult(tool_id="code_executor", output=output)
     except Exception as exc:
-        return ToolResult(
-            tool_id="code_executor",
-            output=f"[Execution error: {exc}]",
-            error=str(exc),
-        )

 """Restricted Python code executor.
+The executor is intentionally narrow:
+  - import statements are rejected before execution
+  - dangerous builtin names are blocked
+  - dunder attribute access is blocked to prevent object graph escapes
+  - only a curated builtin/module surface is exposed
+This keeps the tool useful for HumanEval-style tasks while making the
+security boundaries explicit and testable.
 """
 from __future__ import annotations
+import ast
+import builtins
 import contextlib
+import io
+import math
+import operator
+import collections
+import functools
+import itertools
+import statistics
+import heapq
+import bisect
+import fractions
+import decimal
+import re
+from typing import Any, Dict
 from env.models import OrchestratorAction, ToolResult
 _MAX_OUTPUT_CHARS = 2000
+_BLOCKED_NAMES = {
+    "__builtins__",
+    "__import__",
+    "open",
+    "exec",
+    "eval",
+    "compile",
+    "globals",
+    "locals",
+    "vars",
+    "dir",
+    "getattr",
+    "setattr",
+    "delattr",
+    "input",
+    "help",
+    "type",
+    "object",
+    "super",
+    "memoryview",
+    "breakpoint",
+    "exit",
+    "quit",
+}
+_BLOCKED_ATTRS = {
+    "__class__",
+    "__base__",
+    "__bases__",
+    "__subclasses__",
+    "__mro__",
+    "__globals__",
+    "__code__",
+    "__closure__",
+    "__dict__",
+    "__getattribute__",
+    "__getattr__",
+    "__setattr__",
+    "__delattr__",
+    "__reduce__",
+    "__reduce_ex__",
+    "__func__",
+    "__self__",
+    "__module__",
+}
+_SAFE_BUILTIN_NAMES = {
+    "abs",
+    "all",
+    "any",
+    "bool",
+    "chr",
+    "dict",
+    "enumerate",
+    "float",
+    "int",
+    "isinstance",
+    "issubclass",
+    "len",
+    "list",
+    "map",
+    "max",
+    "min",
+    "pow",
+    "range",
+    "repr",
+    "reversed",
+    "round",
+    "set",
+    "slice",
+    "sorted",
+    "str",
+    "sum",
+    "tuple",
+    "zip",
+    "divmod",
+    "ord",
+    "Exception",
+    "ValueError",
+    "RuntimeError",
+    "TypeError",
+    "KeyError",
+    "IndexError",
+    "AssertionError",
+    "ZeroDivisionError",
+    "object",
+}
+_SAFE_MODULES: Dict[str, Any] = {
+    "math": math,
+    "collections": collections,
+    "functools": functools,
+    "itertools": itertools,
+    "statistics": statistics,
+    "heapq": heapq,
+    "bisect": bisect,
+    "fractions": fractions,
+    "decimal": decimal,
+    "re": re,
+    "operator": operator,
+}
+class SandboxViolation(ValueError):
+    """Raised when code tries to cross the sandbox boundary."""
+def _validate_tree(tree: ast.AST) -> None:
+    for node in ast.walk(tree):
+        if isinstance(node, (ast.Import, ast.ImportFrom)):
+            raise SandboxViolation("import statements are blocked")
+        if isinstance(node, (ast.Global, ast.Nonlocal)):
+            raise SandboxViolation("global and nonlocal are blocked")
+        if isinstance(node, ast.Attribute):
+            if node.attr.startswith("__") or node.attr in _BLOCKED_ATTRS:
+                raise SandboxViolation(f"attribute access '{node.attr}' is blocked")
+        if isinstance(node, ast.Name):
+            if node.id.startswith("__") or node.id in _BLOCKED_NAMES:
+                raise SandboxViolation(f"name '{node.id}' is blocked")
+def _safe_builtins(stdout_buf: io.StringIO) -> Dict[str, Any]:
+    safe: Dict[str, Any] = {name: getattr(builtins, name) for name in _SAFE_BUILTIN_NAMES}
+    def safe_print(*args: Any, **kwargs: Any) -> None:
+        kwargs = dict(kwargs)
+        kwargs.pop("file", None)
+        builtins.print(*args, **kwargs, file=stdout_buf)
+    safe["print"] = safe_print
+    safe["__build_class__"] = builtins.__build_class__
+    return safe
 def code_executor_tool(action: OrchestratorAction) -> ToolResult:
         return ToolResult(tool_id="code_executor", output="[No code provided]", error="empty")
     stdout_buf = io.StringIO()
+    safe_globals: Dict[str, Any] = {
+        "__builtins__": _safe_builtins(stdout_buf),
+        "__name__": "__code_executor__",
+        "__package__": None,
+        **_SAFE_MODULES,
     }
     try:
+        tree = ast.parse(code, mode="exec")
+        _validate_tree(tree)
         with contextlib.redirect_stdout(stdout_buf):
+            exec(compile(tree, "<code_executor>", "exec"), safe_globals)  # noqa: S102
         output = stdout_buf.getvalue()[:_MAX_OUTPUT_CHARS] or "[Code ran, no output]"
         return ToolResult(tool_id="code_executor", output=output)
+    except SandboxViolation as exc:
+        return ToolResult(tool_id="code_executor", output=f"[Sandbox blocked: {exc}]", error="sandbox_violation")
     except Exception as exc:
+        return ToolResult(tool_id="code_executor", output=f"[Execution error: {exc}]", error=str(exc))

tools/runtime.py ADDED Viewed

	@@ -0,0 +1,159 @@

+"""Shared tool catalog and dispatch helpers for CostAwareToolEnv.
+This module keeps the tool contract explicit:
+  - the catalog describes every tool, its purpose, and its input field
+  - registry validation catches config drift early
+  - dispatch normalizes failures into ToolResult objects instead of
+    letting exceptions leak through the environment loop
+"""
+from __future__ import annotations
+from dataclasses import asdict, dataclass
+from typing import Any, Callable, Dict, List, Mapping
+from env.config import EnvConfig
+from env.models import OrchestratorAction, TOOL_IDS, ToolResult
+@dataclass(frozen=True)
+class ToolSpec:
+    tool_id: str
+    label: str
+    purpose: str
+    input_field: str
+    cost: float
+    notes: str
+_TOOL_SPEC_TEMPLATES: Dict[str, Dict[str, str]] = {
+    "ceramic_search": {
+        "label": "Ceramic web search",
+        "purpose": "Web retrieval for multi-hop factual QA",
+        "input_field": "query",
+        "notes": "Falls back to deterministic offline search when Ceramic credentials are unavailable.",
+    },
+    "wiki_lookup": {
+        "label": "Wikipedia lookup",
+        "purpose": "Entity facts, definitions, and short summaries",
+        "input_field": "query",
+        "notes": "Returns an explicit not-found or HTTP error result instead of crashing.",
+    },
+    "calculator": {
+        "label": "Calculator",
+        "purpose": "Arithmetic and symbolic math",
+        "input_field": "expression",
+        "notes": "Uses a restricted AST evaluator with comparisons and common math functions.",
+    },
+    "code_executor": {
+        "label": "Python executor",
+        "purpose": "HumanEval-style coding tasks",
+        "input_field": "code_snippet",
+        "notes": "Sandboxed exec with blocked imports, dunder attribute access, and unsafe builtins.",
+    },
+    "llm_reason": {
+        "label": "LLM reasoning",
+        "purpose": "Costly model-backed reasoning on hard problems",
+        "input_field": "query",
+        "notes": "Returns a clear no_api_key error when Together is unavailable.",
+    },
+    "commit": {
+        "label": "Commit answer",
+        "purpose": "Submit the final answer and advance the episode",
+        "input_field": "answer",
+        "notes": "Pass-through only; grading happens inside the environment.",
+    },
+}
+def validate_tool_costs(config: EnvConfig) -> None:
+    """Fail fast if the configured cost map drifts from the canonical tool set."""
+    missing = [tool_id for tool_id in TOOL_IDS if tool_id not in config.tool_costs]
+    if missing:
+        raise ValueError(f"EnvConfig.tool_costs is missing required tools: {missing}")
+    negative = {tool_id: cost for tool_id, cost in config.tool_costs.items() if cost < 0}
+    if negative:
+        raise ValueError(f"Tool costs must be non-negative: {negative}")
+def build_tool_catalog(config: EnvConfig | None = None) -> List[ToolSpec]:
+    """Return the canonical ordered catalog used by the UI and docs."""
+    cfg = config or EnvConfig()
+    validate_tool_costs(cfg)
+    catalog: List[ToolSpec] = []
+    for tool_id in TOOL_IDS:
+        template = _TOOL_SPEC_TEMPLATES[tool_id]
+        catalog.append(
+            ToolSpec(
+                tool_id=tool_id,
+                label=template["label"],
+                purpose=template["purpose"],
+                input_field=template["input_field"],
+                cost=cfg.tool_costs[tool_id],
+                notes=template["notes"],
+            )
+        )
+    return catalog
+def build_tool_registry(config: EnvConfig | None = None) -> Dict[str, Callable[[OrchestratorAction], ToolResult]]:
+    """Return the canonical mapping of tool_id -> tool callable."""
+    from .calculator import calculator_tool
+    from .ceramic_search import make_search_tool
+    from .code_executor import code_executor_tool
+    from .commit import commit_tool
+    from .llm_reason import llm_reason_tool
+    from .wiki_lookup import wiki_lookup_tool
+    cfg = config or EnvConfig()
+    validate_tool_costs(cfg)
+    return {
+        "ceramic_search": make_search_tool(),
+        "calculator": calculator_tool,
+        "wiki_lookup": wiki_lookup_tool,
+        "code_executor": code_executor_tool,
+        "llm_reason": llm_reason_tool,
+        "commit": commit_tool,
+    }
+def dispatch_tool(
+    tool_id: str,
+    action: OrchestratorAction,
+    registry: Mapping[str, Callable[[OrchestratorAction], ToolResult]],
+) -> ToolResult:
+    """Call a tool and normalize missing-tool and crash cases into ToolResult."""
+    tool_fn = registry.get(tool_id)
+    if tool_fn is None:
+        return ToolResult(
+            tool_id=tool_id,
+            output="[Tool not available in this environment]",
+            error="not_available",
+        )
+    try:
+        result = tool_fn(action)
+    except Exception as exc:  # pragma: no cover - defensive wrapper
+        return ToolResult(
+            tool_id=tool_id,
+            output=f"[Error: {exc}]",
+            error=str(exc),
+        )
+    if not isinstance(result, ToolResult):
+        return ToolResult(
+            tool_id=tool_id,
+            output=f"[Tool error: unexpected return type {type(result).__name__}]",
+            error="invalid_return_type",
+        )
+    if result.tool_id != tool_id:
+        result = result.model_copy(update={"tool_id": tool_id})
+    return result
+def catalog_as_dicts(config: EnvConfig | None = None) -> List[dict[str, Any]]:
+    """Convenience helper for JSON serialization."""
+    return [asdict(spec) for spec in build_tool_catalog(config)]

tools/wiki_lookup.py CHANGED Viewed

@@ -3,6 +3,7 @@ from __future__ import annotations
 import urllib.parse
 import urllib.request
 import json
 from env.models import OrchestratorAction, ToolResult

 import urllib.parse
 import urllib.request
+import urllib.error
 import json
 from env.models import OrchestratorAction, ToolResult