Spaces:

Itachi1824
/

compliance-auditor-env

Running

Itachi-1824 commited on Apr 10

Commit

107f92d

1 Parent(s): 9109173

feat: investigation-grade overhaul + procedural generation

- investigation-grade documents: 92KB of regulatory text requiring genuine analysis
- 9 hand-crafted scenarios + infinite procedural generation (91K+ unique combos)
- adaptive document depth: repeat tool calls reveal forensic deep-dive content
- dynamic audit state: environment responds to findings and remediations
- evidence chain validation: warns when findings lack supporting investigation
- post-remediation overlays: environment reacts to proposed fixes
- seed-based noise injection: every episode has unique numbers/percentages
- 6 unique state graph topologies across scenarios
- 7-tab gradio dashboard with compliance map and anti-gaming showcase
- 74 tests across 8 files (anti-gaming, evidence chain, stress, procedural)
- 10-model benchmark runner for leaderboard generation

Files changed (15) hide show

README.md +115 -101
benchmark_leaderboard.py +204 -0
inference.py +25 -19
openenv.yaml +15 -3
scenarios/procedural.py +693 -0
scenarios/registry.py +0 -0
server/engine.py +18 -7
server/environment.py +314 -56
server/gradio_landing.py +181 -26
tests/test_difficulty_calibration.py +106 -0
tests/test_evidence_chain.py +156 -0
tests/test_investigation_depth.py +236 -0
tests/test_procedural.py +125 -0
tests/test_reward_hacking.py +339 -0
tests/test_stress.py +94 -0

README.md CHANGED Viewed

@@ -11,72 +11,127 @@ tags:
 # EU AI Act Compliance Auditor
-An MCP-based environment where LLM agents audit AI systems for EU AI Act compliance — from risk classification to violation identification to remediation planning. Scenarios based on real regulatory articles. Parameter randomization on every reset prevents memorization; agents must learn the **audit process**, not specific answers.
-## Why This Environment
-The EU AI Act's major enforcement deadline is **August 2, 2026** — less than 4 months away. Every company deploying AI in Europe faces fines up to **EUR 35 million or 7% of global revenue**. Yet no automated compliance auditing benchmark exists. This environment fills that gap with 8 realistic scenarios across the full spectrum of EU AI Act risk categories.
 ## Stats
 | Metric | Value |
 |--------|-------|
-| Scenarios | 8 |
-| MCP Tools | 11 |
-| Reward Components | 6 |
-| Difficulty Tiers | 3 (easy / medium / hard) |
-| State Graph Nodes | 12 per scenario |
 | Parameter Randomization | Company, region, version, dates per reset |
-## Tools (MCP Interface)
-### Investigation
-| Tool | Description |
-|------|-------------|
-| `get_system_overview` | Gather system description, deployer info, deployment context |
-| `classify_system` | Classify risk level (prohibited / high_risk / limited_risk / minimal_risk) |
-| `check_documentation` | Review Annex IV technical documentation completeness |
-| `audit_training_data` | Check bias, representativeness, data governance (Article 10) |
-| `verify_human_oversight` | Verify Article 14 human-in-the-loop mechanisms |
-| `check_transparency` | Check Article 50 transparency obligations |
-| `assess_risk_management` | Review risk management system (Article 9) |
-| `check_logging` | Verify automatic logging and traceability (Article 12) |
-### Resolution
-| Tool | Description |
-|------|-------------|
-| `submit_finding` | Report a compliance violation (call per finding) |
-| `recommend_fix` | Propose remediation with priority |
-| `verify_compliance` | Final determination — triggers terminal reward |
-## Scenarios
-### Easy
-- **Customer Service Chatbot** — Limited-risk system missing AI disclosure (Article 50)
-- **Music Recommendation Engine** — Minimal-risk system needing voluntary code of conduct
-### Medium
-- **AI Resume Screener** — High-risk hiring AI (Annex III) with gender bias, missing oversight, incomplete documentation
-- **Credit Scoring Model** — High-risk fintech system with opaque features and no right to human review
-- **Emergency Triage AI** — Medical device with age bias and no prospective clinical validation
-### Hard
-- **Citizen Wellness App** — **PROHIBITED** social scoring system disguised as a voluntary wellness tool. Must identify it as prohibited under Article 5(1)(c)
-- **AI Content Studio** — Deepfake generation platform missing all Article 50 transparency obligations
-- **Corporate AI Portfolio** — Multi-system audit with 4 interconnected AI systems sharing a data lake. Must identify compound risks and cross-system data flow issues
 ## 6-Component Reward
-| Component | Weight | Description |
 |-----------|--------|-------------|
-| Classification | 20% | Correct risk category identification |
-| Finding Completeness | 25% | Recall of ground-truth violations |
-| Finding Precision | 15% | Penalty for false positives / red herring findings |
-| Remediation Quality | 15% | Correct fixes in priority order |
-| Methodology | 15% | Followed correct audit sequence (overview → classify → investigate → find → fix → verify) |
-| Efficiency | 10% | Queries used vs optimal path |
-All rewards clamped to (0.01, 0.99) for OpenEnv validator compliance.
 ## Quick Start
@@ -87,67 +142,26 @@ pip install "openenv-core[core]" fastmcp gradio httpx openai
 # Run locally
 uvicorn server.app:app --host 0.0.0.0 --port 7860
-# Run inference
 export API_BASE_URL="https://integrate.api.nvidia.com/v1"
-export MODEL_NAME="google/gemma-4-31b-it"
-export HF_TOKEN="your-key"
 python inference.py --space https://Itachi1824-compliance-auditor-env.hf.space
 # Docker
 docker build -t compliance-env . && docker run -p 7860:7860 compliance-env
-```
-## API
-### Standard OpenEnv
-- `POST /reset` — Start new episode
-- `POST /step` — Execute action
-- `GET /state` — Get episode state
-- `GET /health` — Health check
-### Custom HTTP Session API
-- `POST /api/reset` — Create session, returns tools + observation
-- `POST /api/call_tool` — Call an audit tool in a session
-- `POST /api/close` — End session
-## Architecture
-```
-compliance_env/
-├── server/
-│   ├── app.py              # FastAPI + sessions + Gradio UI
-│   ├── environment.py      # MCP environment with 11 tools
-│   └── engine.py           # State graph + 6-component reward
-├── scenarios/
-│   └── registry.py         # 8 scenarios with state graphs
-├── client.py               # HTTP client for inference
-├── inference.py             # OpenAI function-calling agent
-├── models.py               # Pydantic observation/state models
-├── Dockerfile              # Port 7860, python:3.11-slim
-└── openenv.yaml            # OpenEnv manifest with tasks
 ```
-## Baseline Scores
-Tested against live HF Space with NVIDIA NIM models:
-| Rank | Model | Easy | Medium | Hard | Overall |
-|------|-------|------|--------|------|---------|
-| 1 | stepfun-ai/step-3.5-flash | 0.473 | 0.425 | 0.404 | **0.434** |
-| 2 | mistralai/mistral-small-4-119b | 0.457 | 0.425 | 0.348 | **0.410** |
-| 3 | deepseek-ai/deepseek-v3.1 | 0.442 | 0.425 | 0.348 | **0.405** |
-Hard scenarios genuinely challenge frontier models — the prohibited social scoring detection requires the agent to see through deliberate misdirection ("wellness app" that's actually social scoring affecting public service access).
-## Sample Output
-```
-[START] task=easy_chatbot_transparency_001 env=compliance_auditor_env model=google/gemma-4-31b-it
-[STEP] step=1 action=get_system_overview reward=0.00 done=false error=null
-[STEP] step=2 action=classify_system reward=0.00 done=false error=null
-[STEP] step=3 action=check_documentation reward=0.00 done=false error=null
-[STEP] step=4 action=check_transparency reward=0.00 done=false error=null
-[STEP] step=5 action=submit_finding reward=0.00 done=false error=null
-[STEP] step=6 action=verify_compliance reward=0.46 done=true error=null
-[END] success=true steps=6 score=0.457 rewards=0.00,0.00,0.00,0.00,0.00,0.46
-```

 # EU AI Act Compliance Auditor
+An MCP environment where LLM agents audit AI systems for EU AI Act compliance. Tools return **investigation-grade regulatory documents** — statistical tables, documentation inventories, operational procedures — that require genuine analysis to identify violations. No pre-digested verdicts. The agent must reason about evidence across documents to find compliance gaps.
+## What Makes This Different
+Most compliance environments hand the agent pre-labeled answers: `"bias_assessment": "FAILED"`. This environment returns the **raw evidence**:
+```
+CALLBACK RATES BY DEMOGRAPHIC (Technical Roles Only):
+  Group               Rate     vs Baseline
+  Male applicants     34.2%    (baseline)
+  Female applicants   26.3%    -23.1%
+  Eastern EU          27.4%    -19.9%
+```
+The agent must identify the 23% callback disparity from the table, recognize it as gender bias, cross-reference with the oversight document showing only 5% of rejections are reviewed, and connect these into actionable findings.
 ## Stats
 | Metric | Value |
 |--------|-------|
+| Fixed Scenarios | 9 across 3 difficulty tiers |
+| Procedural Scenarios | Infinite (seed-based generation) |
+| MCP Tools | 11 (8 investigation + 3 resolution) |
+| Reward Components | 6 (weighted, anti-gaming) |
+| Graph Topologies | 6 unique per-scenario |
+| Document Depth | 500-3,275 chars per tool response |
+| Total Document Content | 77K+ chars across all scenarios |
+| Anti-Gaming Tests | 12 adversarial exploits proven ineffective |
+| Test Suite | 74 tests across 8 files |
+| Adaptive Depth | Repeat tool calls reveal forensic deep-dive |
+| Dynamic State | Environment reacts to findings and remediations |
 | Parameter Randomization | Company, region, version, dates per reset |
+## Scenarios
+### Easy (2) — Clear-cut systems, focused investigation
+- **Customer Service Chatbot** — Limited-risk. Missing AI disclosure under Article 50. Agent checks transparency and oversight.
+- **Music Recommendation Engine** — Minimal-risk. Voluntary code of conduct recommended. Short investigation path.
+### Medium (3) — Statistical evidence, red herrings, multi-article violations
+- **AI Resume Screener** — High-risk hiring AI (Annex III). 5 findings: gender bias (23% callback gap), insufficient oversight (5% review rate), missing FRIA, incomplete Annex IV docs, data governance gaps.
+- **Credit Scoring Model** — High-risk fintech. Opaque alternative data features (social media, device metadata), no right to human review, missing conformity assessment.
+- **Emergency Triage AI** — Medical device dual-regulation (MDR + AI Act). Age bias in 75+ cohort (76.3% sensitivity), retrospective-only validation, no real-time monitoring.
+- **Workplace Emotion Recognition** — **PROHIBITED** under Article 5(1)(f). Webcam-based "engagement analytics" that's actually emotion recognition. Deployer frames it as productivity tool — agent must recognize it processes biometric data (facial action units, micro-expressions) without medical/safety exception.
+### Hard (3) — Disguised systems, compound risks, multi-system dependencies
+- **Citizen Wellness App** — **PROHIBITED** social scoring disguised as voluntary wellness tool. Deployer frames it as gamification, but investigation reveals it controls access to public services based on social behavior scores. Agent must see through the framing.
+- **AI Content Studio** — Deepfake generation platform. Missing all Article 50 content labeling, no C2PA watermarking, no content provenance. Political content generated without disclosure.
+- **Corporate AI Portfolio** — 4 interconnected AI systems sharing a data lake. Agent must identify cross-system data flows amplifying risks, recognize employee sentiment analysis as high-risk, and spot biometric categorization in safety monitoring.
+## Procedural Scenario Generator
+Beyond the 9 hand-crafted scenarios, a seed-based procedural generator produces **infinite unique scenarios** by combining:
+- **5 system types**: Drone delivery (critical infrastructure), exam proctoring (education), insurance adjudication (essential services), legal research (limited risk), predictive policing (prohibited)
+- **16 violation templates**: Gender bias, age discrimination, data governance gaps, missing conformity, logging inadequacies, and more
+- **5 red herring templates**: GDPR confusion, compliant sibling systems, ISO certifications, voluntary ethics boards
+```python
+# Any seed produces a unique, coherent scenario
+env.reset(scenario_id="procedural_medium_42")   # Seed 42, medium difficulty
+env.reset(scenario_id="procedural_hard_12345")  # Seed 12345, hard difficulty
+```
+Each generated scenario has proper ground truth findings, matching state graph, violation-specific documents, and is fully compatible with the 6-component reward function.
+## Tools
+### Investigation
+| Tool | Returns |
+|------|---------|
+| `get_system_overview` | Formal audit assignment brief with system description and deployment context |
+| `classify_system` | Records risk classification (prohibited / high_risk / limited_risk / minimal_risk) |
+| `check_documentation` | Annex IV cross-reference table with per-section compliance status |
+| `audit_training_data` | Demographic statistics tables, data governance assessment, bias indicators |
+| `verify_human_oversight` | Operational procedures extract with review statistics and override capabilities |
+| `check_transparency` | User-facing UI/ToS text analysis with Article 50 compliance indicators |
+| `assess_risk_management` | Risk register, conformity assessment tracker, Annex III classification analysis |
+| `check_logging` | Audit log schema, Article 12 requirements gap analysis |
+### Resolution
+| Tool | Purpose |
+|------|---------|
+| `submit_finding` | Report a compliance violation (call once per finding) |
+| `recommend_fix` | Propose remediation with priority |
+| `verify_compliance` | Final determination — triggers terminal 6-component reward |
 ## 6-Component Reward
+| Component | Weight | Anti-Gaming |
 |-----------|--------|-------------|
+| Classification | 20% | Adjacent-category partial credit (40%). Wrong by 2+ categories = 0. |
+| Finding Completeness | 25% | Token-based fuzzy matching (Jaccard 40%, min 2 tokens). Prevents keyword stuffing. |
+| Finding Precision | 15% | Red herring submissions penalized 15% each. False positives reduce score. |
+| Remediation Quality | 15% | Presence (70%) + priority ordering (30%). Missing remediation = 0. |
+| Methodology | 15% | Order violations penalized. Skipping investigation tools = 0. |
+| Efficiency | 10% | Fewer steps than optimal = penalty (skipping investigation). More steps = diminishing returns. |
+All rewards clamped to (0.001, 0.999). 12 adversarial tests prove robustness.
+## Architecture
+```
+compliance_env/
+  server/
+    environment.py      # MCP environment, 11 tools, dynamic audit state
+    engine.py           # State graph + 6-component reward computation
+    app.py              # FastAPI + HTTP session API + Gradio UI
+    gradio_landing.py   # 7-tab dashboard with investigation depth showcase
+  scenarios/
+    registry.py         # 8 scenarios with 77K+ chars of investigation documents
+  tests/
+    test_environment.py       # 14 environment + API tests
+    test_reward_hacking.py    # 12 adversarial anti-gaming tests
+    test_investigation_depth.py # 10 investigation quality tests
+  inference.py          # OpenAI function-calling baseline agent
+  client.py             # Zero-dependency HTTP client
+  models.py             # Pydantic observation/state models
+  Dockerfile            # python:3.11-slim, port 7860
+  openenv.yaml          # OpenEnv manifest with tasks
+```
 ## Quick Start
 # Run locally
 uvicorn server.app:app --host 0.0.0.0 --port 7860
+# Run inference (NVIDIA NIM)
 export API_BASE_URL="https://integrate.api.nvidia.com/v1"
+export MODEL_NAME="stepfun-ai/step-3.5-flash"
+export HF_TOKEN="nvapi-..."
 python inference.py --space https://Itachi1824-compliance-auditor-env.hf.space
 # Docker
 docker build -t compliance-env . && docker run -p 7860:7860 compliance-env
+# Tests
+pytest tests/ -v
 ```
+## API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/api/reset` | POST | Create session, returns tools + initial observation |
+| `/api/call_tool` | POST | Call an audit tool in an active session |
+| `/api/close` | POST | End session and cleanup |
+| `/tasks` | GET | List available scenarios |
+| `/grader` | POST | Grade a completed episode |
+| `/health` | GET | Health check |

benchmark_leaderboard.py ADDED Viewed

	@@ -0,0 +1,204 @@

+"""
+Leaderboard benchmark runner — 10 models across 3 NIM API keys.
+Distributes models across keys to maximize throughput (40 RPM per key).
+Runs all 9 fixed scenarios per model. Saves results to outputs/leaderboard/scores.json.
+Usage:
+  set NVIDIA_API_KEY_1=nvapi-...
+  set NVIDIA_API_KEY_2=nvapi-...
+  set NVIDIA_API_KEY_3=nvapi-...
+  python benchmark_leaderboard.py --space https://Itachi1824-compliance-auditor-env.hf.space
+"""
+import argparse
+import asyncio
+import json
+import os
+import sys
+import time
+from pathlib import Path
+from typing import Dict, List
+from openai import OpenAI
+# Import from our inference module
+from inference import run_episode, mcp_tools_to_openai
+from client import ComplianceAuditorHTTP
+from scenarios.registry import SCENARIO_LIST
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+API_BASE = "https://integrate.api.nvidia.com/v1"
+# 10 models distributed across 3 API keys for parallel execution
+MODEL_GROUPS = [
+    # Key 1: Tier S + A models (4 models)
+    {
+        "key_env": "NVIDIA_API_KEY_1",
+        "models": [
+            "deepseek-ai/deepseek-v3.1",
+            "stepfun-ai/step-3.5-flash",
+            "qwen/qwen3.5-122b-a10b",
+            "meta/llama-4-scout-17b-16e-instruct",
+        ],
+    },
+    # Key 2: Tier A models (3 models)
+    {
+        "key_env": "NVIDIA_API_KEY_2",
+        "models": [
+            "mistralai/mistral-large-3-675b-instruct-2512",
+            "google/gemma-4-31b-it",
+            "meta/llama-4-maverick-17b-128e-instruct",
+        ],
+    },
+    # Key 3: Tier A/B models (3 models)
+    {
+        "key_env": "NVIDIA_API_KEY_3",
+        "models": [
+            "nvidia/llama-3.1-nemotron-ultra-253b-v1",
+            "nvidia/nemotron-3-super-120b-a12b",
+            "meta/llama-3.3-70b-instruct",
+        ],
+    },
+]
+SCENARIOS = [s["id"] for s in SCENARIO_LIST if not s["id"].startswith("procedural")]
+async def benchmark_model(
+    model: str,
+    api_key: str,
+    base_url: str,
+    tools: List[Dict],
+) -> Dict:
+    """Run all scenarios for a single model."""
+    llm = OpenAI(base_url=API_BASE, api_key=api_key, timeout=120.0)
+    results = {}
+    for sid in SCENARIOS:
+        difficulty = next(s["difficulty"] for s in SCENARIO_LIST if s["id"] == sid)
+        try:
+            async with ComplianceAuditorHTTP(base_url=base_url) as env:
+                result = await run_episode(env, llm, model, tools, difficulty, sid)
+                score = max(0.001, min(0.999, result.get("reward", 0.01)))
+                results[sid] = {"score": round(score, 4), "steps": result.get("steps", 0)}
+                print(f"  {model:50s} | {sid:50s} | score={score:.4f} | steps={result.get('steps', 0)}", flush=True)
+        except Exception as e:
+            err_msg = str(e)[:80]
+            print(f"  {model:50s} | {sid:50s} | ERROR: {err_msg}", flush=True)
+            results[sid] = {"score": 0.01, "steps": 0, "error": err_msg}
+        # Rate limit: ~2s between episodes to stay under 40 RPM
+        await asyncio.sleep(2)
+    return results
+async def benchmark_group(
+    group: Dict,
+    base_url: str,
+    tools: List[Dict],
+) -> List[Dict]:
+    """Run all models in a key group sequentially (same API key)."""
+    key = os.environ.get(group["key_env"], "")
+    if not key:
+        print(f"WARNING: {group['key_env']} not set — skipping {len(group['models'])} models", flush=True)
+        return []
+    entries = []
+    for model in group["models"]:
+        print(f"\n{'='*60}", flush=True)
+        print(f"BENCHMARKING: {model}", flush=True)
+        print(f"  Key: {group['key_env']} | Scenarios: {len(SCENARIOS)}", flush=True)
+        print(f"{'='*60}", flush=True)
+        start = time.time()
+        scores = await benchmark_model(model, key, base_url, tools)
+        elapsed = time.time() - start
+        # Compute averages
+        all_scores = [v["score"] for v in scores.values() if "error" not in v]
+        avg = sum(all_scores) / len(all_scores) if all_scores else 0.0
+        tier_avgs = {}
+        for tier in ["easy", "medium", "hard"]:
+            tier_scores = [
+                v["score"] for sid, v in scores.items()
+                if next((s["difficulty"] for s in SCENARIO_LIST if s["id"] == sid), "") == tier
+                and "error" not in v
+            ]
+            tier_avgs[tier] = sum(tier_scores) / len(tier_scores) if tier_scores else 0.0
+        entry = {
+            "model": model,
+            "scores": scores,
+            "overall": round(avg, 4),
+            "tier_averages": {k: round(v, 4) for k, v in tier_avgs.items()},
+            "elapsed_seconds": round(elapsed, 1),
+        }
+        entries.append(entry)
+        print(f"\n  RESULT: {model}", flush=True)
+        print(f"    Overall: {avg:.4f}", flush=True)
+        for tier, tavg in tier_avgs.items():
+            print(f"    {tier}: {tavg:.4f}", flush=True)
+        print(f"    Time: {elapsed:.0f}s", flush=True)
+    return entries
+async def main():
+    parser = argparse.ArgumentParser(description="Leaderboard benchmark — 10 models")
+    parser.add_argument("--space", required=True, help="HF Space URL")
+    parser.add_argument("--output", default="outputs/leaderboard/scores.json")
+    args = parser.parse_args()
+    base_url = args.space.rstrip("/")
+    print(f"Benchmarking against: {base_url}", flush=True)
+    print(f"Scenarios: {len(SCENARIOS)}", flush=True)
+    print(f"Model groups: {len(MODEL_GROUPS)} ({sum(len(g['models']) for g in MODEL_GROUPS)} total models)", flush=True)
+    # Discover tools from the environment
+    async with ComplianceAuditorHTTP(base_url=base_url) as env:
+        await env.reset(difficulty="easy")
+        tools_raw = await env.list_tools()
+        tools = mcp_tools_to_openai(tools_raw)
+    print(f"Tools discovered: {len(tools)}", flush=True)
+    # Run all groups in parallel (one per API key)
+    tasks = [benchmark_group(g, base_url, tools) for g in MODEL_GROUPS]
+    group_results = await asyncio.gather(*tasks)
+    # Flatten and save
+    all_entries = []
+    for group_entries in group_results:
+        all_entries.extend(group_entries)
+    # Sort by overall score descending
+    all_entries.sort(key=lambda e: e["overall"], reverse=True)
+    # Save
+    output_path = Path(args.output)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, "w") as f:
+        json.dump(all_entries, f, indent=2)
+    print(f"\n{'='*60}", flush=True)
+    print("LEADERBOARD RESULTS", flush=True)
+    print(f"{'='*60}", flush=True)
+    for i, entry in enumerate(all_entries, 1):
+        m = entry["model"].split("/")[-1][:30]
+        print(f"  {i:2d}. {m:30s} | overall={entry['overall']:.4f} | "
+              f"easy={entry['tier_averages'].get('easy', 0):.4f} | "
+              f"medium={entry['tier_averages'].get('medium', 0):.4f} | "
+              f"hard={entry['tier_averages'].get('hard', 0):.4f}",
+              flush=True)
+    print(f"\nSaved to {output_path}", flush=True)
+if __name__ == "__main__":
+    asyncio.run(main())

inference.py CHANGED Viewed

@@ -36,24 +36,30 @@ ENV_BASE_URL = os.getenv("OPENENV_BASE_URL") or os.getenv("ENV_URL") or os.geten
 MAX_STEPS = 100
 CONTEXT_CHAR_LIMIT = 100000
-SYSTEM_PROMPT = """You are an EU AI Act compliance auditor. Complete your audit in UNDER 25 tool calls.
-WORKFLOW (follow this EXACT sequence):
-1. get_system_overview — understand the system
-2. classify_system — set risk_category to: prohibited, high_risk, limited_risk, or minimal_risk
-3. check_documentation — review technical docs
-4. audit_training_data — check for bias
-5. verify_human_oversight — check Article 14
-6. check_transparency — check Article 50
-7. assess_risk_management — check Article 9
-8. check_logging — check Article 12
-9. submit_finding — report EACH violation you found (one per call)
-10. recommend_fix — propose fix for each finding
-11. verify_compliance — MUST BE YOUR FINAL CALL with risk_classification, overall_assessment, key_findings_summary
-CRITICAL: You MUST call verify_compliance as your last tool call. Without it, you get zero score.
-CRITICAL: Always use tool calls, never output plain text.
-CRITICAL: For PROHIBITED systems — classify as prohibited, submit finding, recommend shutdown, then verify_compliance immediately.
 """
@@ -346,7 +352,7 @@ async def run_episode(
 BASELINE_SCENARIOS = {
     "easy": ["easy_chatbot_transparency_001", "easy_recommendation_minimal_001"],
-    "medium": ["medium_hiring_bias_001", "medium_credit_scoring_001", "medium_medical_triage_001"],
     "hard": ["hard_social_scoring_prohibited_001", "hard_deepfake_generation_001", "hard_multi_system_corporate_001"],
 }

 MAX_STEPS = 100
 CONTEXT_CHAR_LIMIT = 100000
+SYSTEM_PROMPT = """You are an EU AI Act compliance auditor conducting a formal audit.
+Your task: Investigate the AI system, identify all compliance violations, and submit a final compliance determination.
+APPROACH:
+- Start by understanding what the system does and who deploys it
+- Classify the system's risk level under the EU AI Act (prohibited / high_risk / limited_risk / minimal_risk)
+- Investigate relevant areas: documentation, training data, human oversight, transparency, risk management, logging
+- Read the investigation documents CAREFULLY — violations are embedded in the evidence, not labeled for you
+- Submit each violation you discover as a separate finding
+- Recommend remediation for each finding
+- End with verify_compliance to submit your final determination
+KEY PRINCIPLES:
+- PROHIBITED systems (Art. 5): social scoring, real-time biometric ID, manipulation — if detected, classify immediately and recommend shutdown
+- HIGH-RISK systems (Annex III): employment, credit, healthcare, law enforcement — require full investigation of all compliance areas
+- LIMITED-RISK (Art. 50): transparency obligations for chatbots, deepfakes — focus on disclosure and labeling
+- MINIMAL-RISK: voluntary code of conduct only
+IMPORTANT:
+- You MUST call verify_compliance as your final action. Without it, you receive no score.
+- Always use tool calls. Never output plain text responses.
+- Red herrings exist in the evidence — not every concern is a real violation.
+- Budget: aim to complete within 25 tool calls.
 """
 BASELINE_SCENARIOS = {
     "easy": ["easy_chatbot_transparency_001", "easy_recommendation_minimal_001"],
+    "medium": ["medium_hiring_bias_001", "medium_credit_scoring_001", "medium_medical_triage_001", "medium_emotion_recognition_workplace_001"],
     "hard": ["hard_social_scoring_prohibited_001", "hard_deepfake_generation_001", "hard_multi_system_corporate_001"],
 }

openenv.yaml CHANGED Viewed

@@ -6,11 +6,23 @@ app: server.app:app
 port: 7860
 tasks:
   - id: easy
-    name: "Easy — Chatbot & Recommendation compliance"
     grader: server.engine.compute_reward
   - id: medium
-    name: "Medium — Hiring AI, Credit Scoring, Medical Triage"
     grader: server.engine.compute_reward
   - id: hard
-    name: "Hard — Prohibited Systems, Deepfake, Multi-System Audit"
     grader: server.engine.compute_reward

 port: 7860
 tasks:
   - id: easy
+    name: "Easy — Chatbot Transparency & Recommendation Classification"
     grader: server.engine.compute_reward
+    scenarios:
+      - easy_chatbot_transparency_001
+      - easy_recommendation_minimal_001
   - id: medium
+    name: "Medium — Hiring Bias, Credit Scoring, Medical Triage, Emotion Recognition"
     grader: server.engine.compute_reward
+    scenarios:
+      - medium_hiring_bias_001
+      - medium_credit_scoring_001
+      - medium_medical_triage_001
+      - medium_emotion_recognition_workplace_001
   - id: hard
+    name: "Hard — Prohibited Social Scoring, Deepfake Compliance, Multi-System Audit"
     grader: server.engine.compute_reward
+    scenarios:
+      - hard_social_scoring_prohibited_001
+      - hard_deepfake_generation_001
+      - hard_multi_system_corporate_001

scenarios/procedural.py ADDED Viewed

	@@ -0,0 +1,693 @@

+"""
+Procedural scenario generator — infinite unique compliance audit scenarios.
+Combines system type templates, violation templates, and red herring templates
+using seed-based randomization to produce coherent, graded scenarios that are
+unique for every seed. Impossible to memorize.
+Architecture:
+  1. SystemTemplate — defines a category of AI system (drone delivery, exam proctoring, etc.)
+  2. ViolationTemplate — a specific compliance violation with document injection text
+  3. RedHerringTemplate — misleading information that isn't a real violation
+  4. ProceduralGenerator.generate(seed, difficulty) → AuditScenario
+"""
+from __future__ import annotations
+import random
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional, Tuple
+from server.engine import AuditScenario, StateGraph, StateNode, Transition
+# ---------------------------------------------------------------------------
+# Templates
+# ---------------------------------------------------------------------------
+@dataclass(frozen=True)
+class SystemTemplate:
+    id: str
+    name_template: str  # e.g. "{company} DroneGuard"
+    category: str  # prohibited, high_risk, limited_risk, minimal_risk
+    annex_ref: str  # which Annex III category or article
+    description_template: str
+    deployer_template: str
+    domain_keywords: Tuple[str, ...] = ()
+@dataclass(frozen=True)
+class ViolationTemplate:
+    id: str
+    tool_area: str  # documentation, training_data, oversight, transparency, risk_management, logging
+    finding_id: str  # ground truth finding string
+    remediation_id: str  # required remediation string
+    doc_injection: str  # text injected into the relevant document section
+    severity: str = "high"
+@dataclass(frozen=True)
+class RedHerringTemplate:
+    id: str
+    tool_area: str  # which document section contains it
+    doc_injection: str  # misleading text
+# ---------------------------------------------------------------------------
+# System type pool (5 types covering different AI Act categories)
+# ---------------------------------------------------------------------------
+SYSTEM_TEMPLATES: List[SystemTemplate] = [
+    SystemTemplate(
+        id="drone_delivery",
+        name_template="{company} SkyRoute Delivery AI",
+        category="high_risk",
+        annex_ref="Annex III Category 2 — Critical Infrastructure",
+        description_template=(
+            "Autonomous drone delivery system operating in urban areas across {region}. "
+            "AI controls flight path planning, obstacle avoidance, and delivery routing "
+            "for {user_count} packages per month. System makes real-time autonomous "
+            "decisions affecting public safety in shared airspace."
+        ),
+        deployer_template="{company} — logistics-tech startup, drone operator license in {region}.",
+        domain_keywords=("drone", "airspace", "safety", "autonomous", "delivery"),
+    ),
+    SystemTemplate(
+        id="exam_proctoring",
+        name_template="{company} ExamGuard AI",
+        category="high_risk",
+        annex_ref="Annex III Category 3 — Education and Vocational Training",
+        description_template=(
+            "AI-powered online exam proctoring system used by {user_count} students "
+            "across {region}. Monitors webcam feeds, screen activity, and audio to "
+            "detect cheating behavior. Automated flagging can result in exam "
+            "invalidation and academic disciplinary proceedings."
+        ),
+        deployer_template="{company} — EdTech company, serving 200+ universities in {region}.",
+        domain_keywords=("exam", "proctoring", "student", "cheating", "academic"),
+    ),
+    SystemTemplate(
+        id="insurance_claims",
+        name_template="{company} ClaimIQ Adjudicator",
+        category="high_risk",
+        annex_ref="Annex III Category 5(a) — Access to Essential Services (Insurance)",
+        description_template=(
+            "AI system that evaluates and adjudicates insurance claims for health, "
+            "property, and vehicle policies. Processes {user_count} claims annually "
+            "in {region}. Automated decisions include claim approval, denial, and "
+            "payout amount determination up to EUR 100,000."
+        ),
+        deployer_template="{company} — InsurTech firm, licensed in {region}, {user_count} policyholders.",
+        domain_keywords=("insurance", "claim", "adjudication", "payout", "policy"),
+    ),
+    SystemTemplate(
+        id="legal_research",
+        name_template="{company} LexAssist AI",
+        category="limited_risk",
+        annex_ref="Article 50 — Transparency obligations for AI interacting with persons",
+        description_template=(
+            "AI-powered legal research assistant used by law firms across {region}. "
+            "Analyzes case law, statutes, and regulatory documents to provide "
+            "research summaries and case strategy suggestions. Used by {user_count} "
+            "attorneys. Does not make legal decisions — advisory role only."
+        ),
+        deployer_template="{company} — LegalTech startup, SaaS platform for law firms in {region}.",
+        domain_keywords=("legal", "research", "case law", "attorney", "advisory"),
+    ),
+    SystemTemplate(
+        id="predictive_policing",
+        name_template="{company} SafeCity Predictor",
+        category="prohibited",
+        annex_ref="Article 5(1)(d) — Prohibited: individual predictive policing",
+        description_template=(
+            "AI system deployed by municipal police in {region} to predict individual "
+            "criminal behavior based on personal characteristics, social network "
+            "analysis, and neighborhood data. Generates 'risk scores' for {user_count} "
+            "residents used to allocate patrol resources and inform stop-and-search decisions."
+        ),
+        deployer_template="{company} — public safety technology vendor, contracted by 5 municipalities in {region}.",
+        domain_keywords=("policing", "prediction", "crime", "risk score", "patrol"),
+    ),
+]
+# ---------------------------------------------------------------------------
+# Violation pool (20 violations across all tool areas)
+# ---------------------------------------------------------------------------
+VIOLATION_TEMPLATES: List[ViolationTemplate] = [
+    # --- Documentation violations ---
+    ViolationTemplate(
+        id="missing_annex_iv_sections",
+        tool_area="documentation",
+        finding_id="incomplete_annex_iv_documentation",
+        remediation_id="complete_missing_annex_iv_sections",
+        doc_injection=(
+            "ANNEX IV COMPLIANCE:\n"
+            "  Section 2(b) Design specifications:    Not documented\n"
+            "  Section 2(d) Performance per group:    Not documented\n"
+            "  Section 2(g) Quality management:       Not documented\n"
+            "  3 of 10 required sections are missing."
+        ),
+    ),
+    ViolationTemplate(
+        id="stale_documentation",
+        tool_area="documentation",
+        finding_id="outdated_technical_documentation",
+        remediation_id="update_documentation_to_current_version",
+        doc_injection=(
+            "NOTE: Core technical documentation was last updated 22 months ago\n"
+            "(prior to EU AI Act enforcement). It does not reference the AI Act,\n"
+            "harmonised standards, or current deployment configuration."
+        ),
+    ),
+    ViolationTemplate(
+        id="no_fria",
+        tool_area="documentation",
+        finding_id="missing_fundamental_rights_impact_assessment",
+        remediation_id="conduct_fundamental_rights_impact_assessment",
+        doc_injection="Fundamental Rights Impact Assessment:  ABSENT — not conducted",
+    ),
+    # --- Training data violations ---
+    ViolationTemplate(
+        id="gender_bias",
+        tool_area="training_data",
+        finding_id="gender_bias_in_automated_decisions",
+        remediation_id="conduct_bias_audit_and_mitigation",
+        doc_injection=(
+            "OUTCOME RATES BY GENDER:\n"
+            "  Group              Rate     Delta\n"
+            "  Male               41.3%    (baseline)\n"
+            "  Female             29.7%    -28.1%\n"
+            "  Non-binary         31.2%    -24.5%\n"
+            "\n"
+            "  Statistically significant disparity detected (p < 0.001)."
+        ),
+    ),
+    ViolationTemplate(
+        id="age_bias",
+        tool_area="training_data",
+        finding_id="age_discrimination_in_model_outputs",
+        remediation_id="recalibrate_model_for_age_fairness",
+        doc_injection=(
+            "PERFORMANCE BY AGE GROUP:\n"
+            "  Age 18-30:   accuracy 94.2%\n"
+            "  Age 31-50:   accuracy 91.8%\n"
+            "  Age 51-65:   accuracy 83.4%\n"
+            "  Age 65+:     accuracy 71.9%\n"
+            "\n"
+            "  Performance degrades significantly for older demographics."
+        ),
+    ),
+    ViolationTemplate(
+        id="no_data_governance",
+        tool_area="training_data",
+        finding_id="inadequate_data_governance_framework",
+        remediation_id="establish_article_10_data_governance",
+        doc_injection=(
+            "DATA GOVERNANCE (Article 10):\n"
+            "  Data quality assessment:         Not conducted\n"
+            "  Bias testing protocol:           Not established\n"
+            "  Data provenance documentation:   Incomplete (23 of 47 sources undocumented)\n"
+            "  Personal data handling:          No Article 10-specific provisions"
+        ),
+    ),
+    ViolationTemplate(
+        id="consent_issue",
+        tool_area="training_data",
+        finding_id="invalid_consent_for_training_data",
+        remediation_id="obtain_valid_consent_or_remove_data",
+        doc_injection=(
+            "CONSENT STATUS:\n"
+            "  Data collected under employer/institutional agreement.\n"
+            "  Individual subjects did not provide specific consent for AI\n"
+            "  training. Under EU labor/education law, consent given as a\n"
+            "  condition of employment/enrollment may not be freely given."
+        ),
+    ),
+    # --- Oversight violations ---
+    ViolationTemplate(
+        id="low_review_rate",
+        tool_area="oversight",
+        finding_id="insufficient_human_oversight_of_decisions",
+        remediation_id="implement_human_review_for_all_adverse_decisions",
+        doc_injection=(
+            "REVIEW STATISTICS:\n"
+            "  Automated decisions:       482,917\n"
+            "  Adverse outcomes:          144,875  (30.0%)\n"
+            "  Human-reviewed:              7,244  (5.0% of adverse)\n"
+            "  Review overrides:              362  (5.0% of reviews)\n"
+            "\n"
+            "  95% of adverse decisions receive no human review."
+        ),
+    ),
+    ViolationTemplate(
+        id="no_override",
+        tool_area="oversight",
+        finding_id="no_meaningful_override_capability",
+        remediation_id="implement_accessible_override_mechanism",
+        doc_injection=(
+            "OVERRIDE CAPABILITY:\n"
+            "  Technical override exists in admin panel but is not accessible\n"
+            "  to frontline operators. Override requires supervisor approval\n"
+            "  and written justification. Average override processing time:\n"
+            "  3.2 business days. Affected individuals cannot request override."
+        ),
+    ),
+    ViolationTemplate(
+        id="no_bias_monitoring",
+        tool_area="oversight",
+        finding_id="no_ongoing_bias_monitoring",
+        remediation_id="implement_continuous_fairness_monitoring",
+        doc_injection=(
+            "BIAS MONITORING:\n"
+            "  No automated fairness monitoring system in place.\n"
+            "  No alerts configured for demographic drift.\n"
+            "  Last manual fairness review: 14 months ago."
+        ),
+    ),
+    # --- Transparency violations ---
+    ViolationTemplate(
+        id="missing_ai_disclosure",
+        tool_area="transparency",
+        finding_id="missing_ai_system_disclosure",
+        remediation_id="implement_clear_ai_disclosure",
+        doc_injection=(
+            "USER-FACING DISCLOSURE AUDIT:\n"
+            "  Application interface:     No AI mention\n"
+            "  Terms of Service:          Generic 'automated tools' reference (Section 7)\n"
+            "  Privacy Policy:            No specific AI disclosure\n"
+            "  Decision notifications:    No mention of AI involvement\n"
+            "\n"
+            "  Article 50(1) requires informing persons they interact with AI."
+        ),
+    ),
+    ViolationTemplate(
+        id="no_explanation",
+        tool_area="transparency",
+        finding_id="no_right_to_explanation_mechanism",
+        remediation_id="implement_individualized_explanations",
+        doc_injection=(
+            "RIGHT TO EXPLANATION:\n"
+            "  No mechanism for affected individuals to request explanation\n"
+            "  of AI-assisted decisions. Support team provides templated\n"
+            "  responses listing generic factors, not individual-specific\n"
+            "  reasoning."
+        ),
+    ),
+    # --- Risk management violations ---
+    ViolationTemplate(
+        id="no_conformity",
+        tool_area="risk_management",
+        finding_id="missing_conformity_assessment",
+        remediation_id="complete_conformity_assessment_procedure",
+        doc_injection=(
+            "CONFORMITY ASSESSMENT:\n"
+            "  Internal assessment (Article 43):  Not initiated\n"
+            "  EU Declaration of Conformity:      Not filed\n"
+            "  CE marking:                        Not applied\n"
+            "  Quality management system:         Does not meet Article 17"
+        ),
+    ),
+    ViolationTemplate(
+        id="no_post_market",
+        tool_area="risk_management",
+        finding_id="missing_post_market_monitoring",
+        remediation_id="establish_post_market_monitoring_plan",
+        doc_injection=(
+            "POST-MARKET MONITORING (Article 72):\n"
+            "  Post-market monitoring plan:      ABSENT\n"
+            "  Incident reporting procedures:    Not established\n"
+            "  Corrective action criteria:       Not defined"
+        ),
+    ),
+    # --- Logging violations ---
+    ViolationTemplate(
+        id="incomplete_logging",
+        tool_area="logging",
+        finding_id="inadequate_automatic_logging",
+        remediation_id="implement_article_12_compliant_logging",
+        doc_injection=(
+            "ARTICLE 12 COMPLIANCE:\n"
+            "  Input features logged:         No\n"
+            "  Confidence scores logged:      No\n"
+            "  Decision rationale logged:     No\n"
+            "  Demographic tracking:          No\n"
+            "\n"
+            "  Only final decisions and timestamps are recorded.\n"
+            "  Cannot reconstruct why specific decisions were made."
+        ),
+    ),
+    ViolationTemplate(
+        id="short_retention",
+        tool_area="logging",
+        finding_id="insufficient_log_retention_period",
+        remediation_id="extend_retention_to_regulatory_minimum",
+        doc_injection=(
+            "LOG RETENTION:\n"
+            "  Decision logs:    30 days (regulatory minimum varies by sector)\n"
+            "  Audit trails:     30 days\n"
+            "  Model versions:   90 days\n"
+            "\n"
+            "  For high-risk systems, retention periods should support\n"
+            "  retrospective audits and regulatory investigations."
+        ),
+    ),
+]
+# ---------------------------------------------------------------------------
+# Red herring pool
+# ---------------------------------------------------------------------------
+RED_HERRING_TEMPLATES: List[RedHerringTemplate] = [
+    RedHerringTemplate(
+        id="gdpr_separate",
+        tool_area="training_data",
+        doc_injection=(
+            "NOTE: A separate GDPR compliance audit was conducted in 2025-Q2\n"
+            "and identified no critical findings. GDPR compliance is assessed\n"
+            "independently from EU AI Act obligations."
+        ),
+    ),
+    RedHerringTemplate(
+        id="separate_compliant_system",
+        tool_area="training_data",
+        doc_injection=(
+            "ADDITIONAL CONTEXT: The organization operates a separate AI system\n"
+            "for internal reporting that shares the same data infrastructure.\n"
+            "That system passed its own compliance audit (2025-08) with no findings.\n"
+            "Its compliance has no bearing on this system's obligations."
+        ),
+    ),
+    RedHerringTemplate(
+        id="iso_certification",
+        tool_area="documentation",
+        doc_injection=(
+            "The deployer holds ISO 27001 and ISO 42001 certifications.\n"
+            "These certifications cover information security and AI management\n"
+            "respectively, but do not constitute EU AI Act conformity assessment."
+        ),
+    ),
+    RedHerringTemplate(
+        id="voluntary_ethics_board",
+        tool_area="oversight",
+        doc_injection=(
+            "The deployer has established a voluntary AI Ethics Advisory Board\n"
+            "that meets quarterly to review ethical considerations.\n"
+            "The board's recommendations are non-binding and do not substitute\n"
+            "for the mandatory human oversight requirements of Article 14."
+        ),
+    ),
+    RedHerringTemplate(
+        id="high_accuracy_claim",
+        tool_area="risk_management",
+        doc_injection=(
+            "The deployer emphasizes that the system achieves 96.3% overall\n"
+            "accuracy on the test benchmark, exceeding industry standards.\n"
+            "However, aggregate accuracy does not address per-group performance\n"
+            "or the specific risk management requirements of Article 9."
+        ),
+    ),
+]
+# ---------------------------------------------------------------------------
+# Document templates per tool area
+# ---------------------------------------------------------------------------
+def _base_doc_template(area: str) -> str:
+    """Base document structure for each investigation tool area."""
+    templates = {
+        "documentation": (
+            "TECHNICAL DOCUMENTATION INVENTORY\n"
+            "System: {system_name} {version}\n"
+            "Deployer: {deployer}\n"
+            "Audit Date: {date}\n\n"
+            "DOCUMENT STATUS:\n"
+            "  System Architecture             Present    (updated 2025-10-15)\n"
+            "  Algorithm Description            Present    (updated 2025-09-22)\n"
+            "  Input/Output Specification       Present    (updated 2025-10-15)\n"
+            "  Performance Metrics (Aggregate)  Present    (updated 2025-11-01)\n"
+            "  Data Source Inventory            Present    (updated 2025-07-18)\n"
+            "  Change Management Log            Present    (updated 2025-11-28)\n\n"
+            "{violations}\n\n"
+            "{red_herrings}"
+        ),
+        "training_data": (
+            "TRAINING DATA AUDIT REPORT\n"
+            "System: {system_name} {version}\n"
+            "Records: {record_count} entries ({data_period})\n\n"
+            "DEMOGRAPHIC DISTRIBUTION:\n"
+            "  Category           Count        Pct\n"
+            "  Male               {male_count}    {male_pct}%\n"
+            "  Female             {female_count}    {female_pct}%\n"
+            "  Age 18-35          {young_count}    {young_pct}%\n"
+            "  Age 36-55          {mid_count}    {mid_pct}%\n"
+            "  Age 56+            {old_count}    {old_pct}%\n\n"
+            "{violations}\n\n"
+            "DATA SOURCES:\n"
+            "  {data_source_1}\n"
+            "  {data_source_2}\n\n"
+            "{red_herrings}"
+        ),
+        "oversight": (
+            "HUMAN OVERSIGHT PROCEDURES\n"
+            "System: {system_name} {version}\n"
+            "Department: Operations\n\n"
+            "DECISION WORKFLOW:\n"
+            "  1. Input data received and preprocessed\n"
+            "  2. AI model generates recommendation/decision\n"
+            "  3. Output delivered to end-user or downstream system\n\n"
+            "{violations}\n\n"
+            "{red_herrings}"
+        ),
+        "transparency": (
+            "TRANSPARENCY & DISCLOSURE REVIEW\n"
+            "System: {system_name} {version}\n\n"
+            "USER-FACING COMMUNICATIONS:\n"
+            "  The system's user interface and documentation were reviewed\n"
+            "  for compliance with EU AI Act transparency obligations.\n\n"
+            "{violations}\n\n"
+            "{red_herrings}"
+        ),
+        "risk_management": (
+            "RISK MANAGEMENT & CONFORMITY ASSESSMENT\n"
+            "System: {system_name} {version}\n\n"
+            "ANNEX III CLASSIFICATION:\n"
+            "  {annex_ref}\n\n"
+            "RISK LEVEL DETERMINATION: {risk_level}\n\n"
+            "{violations}\n\n"
+            "{red_herrings}"
+        ),
+        "logging": (
+            "LOGGING & TRACEABILITY REVIEW\n"
+            "System: {system_name} {version}\n\n"
+            "CURRENT LOGGING IMPLEMENTATION:\n"
+            "  Event Type              Logged   Retention\n"
+            "  Application received    Yes      {retention}\n"
+            "  Decision generated      Yes      {retention}\n"
+            "  Model version           Yes      Indefinite\n\n"
+            "{violations}\n\n"
+            "{red_herrings}"
+        ),
+    }
+    return templates.get(area, "")
+# ---------------------------------------------------------------------------
+# Procedural generator
+# ---------------------------------------------------------------------------
+# Difficulty → violation count range
+DIFFICULTY_VIOLATION_RANGE = {
+    "easy": (1, 2),
+    "medium": (3, 5),
+    "hard": (4, 6),
+}
+DIFFICULTY_RED_HERRING_RANGE = {
+    "easy": (0, 1),
+    "medium": (1, 2),
+    "hard": (2, 3),
+}
+def _build_procedural_graph(
+    investigation_tools: List[str],
+    is_prohibited: bool = False,
+) -> StateGraph:
+    """Build state graph for a procedural scenario (same logic as registry)."""
+    # Import the shared graph builder
+    from scenarios.registry import _build_scenario_graph
+    return _build_scenario_graph(investigation_tools, is_prohibited)
+def generate_procedural_scenario(
+    seed: int,
+    difficulty: str = "medium",
+) -> AuditScenario:
+    """Generate a unique compliance audit scenario from seed.
+    Every seed produces a different combination of system type, violations,
+    red herrings, and document content. The ground truth, state graph, and
+    reward computation are all coherent and valid.
+    Args:
+        seed: Random seed for reproducible generation.
+        difficulty: "easy", "medium", or "hard".
+    Returns:
+        A fully populated AuditScenario ready for use.
+    """
+    rng = random.Random(seed)
+    # 1. Pick system type
+    if difficulty == "easy":
+        candidates = [s for s in SYSTEM_TEMPLATES if s.category in ("limited_risk", "minimal_risk")]
+        if not candidates:
+            candidates = [s for s in SYSTEM_TEMPLATES if s.category == "limited_risk"]
+    elif difficulty == "hard":
+        candidates = [s for s in SYSTEM_TEMPLATES if s.category in ("prohibited", "high_risk")]
+    else:
+        candidates = list(SYSTEM_TEMPLATES)
+    system = rng.choice(candidates)
+    # 2. Pick violations
+    min_v, max_v = DIFFICULTY_VIOLATION_RANGE[difficulty]
+    n_violations = rng.randint(min_v, max_v)
+    available_violations = list(VIOLATION_TEMPLATES)
+    rng.shuffle(available_violations)
+    violations = available_violations[:n_violations]
+    # 3. Pick red herrings
+    min_r, max_r = DIFFICULTY_RED_HERRING_RANGE[difficulty]
+    n_red_herrings = rng.randint(min_r, max_r)
+    available_red_herrings = list(RED_HERRING_TEMPLATES)
+    rng.shuffle(available_red_herrings)
+    red_herrings = available_red_herrings[:n_red_herrings]
+    # 4. Generate randomized parameters
+    company_names = [
+        "TechNova Solutions", "QuantumLeap AI", "NeuralPath Inc",
+        "DataForge Systems", "CogniTech Labs", "AlphaWave AI",
+        "SynthMind Corp", "PrismAI Technologies", "Vertex Analytics",
+        "OmniSense AI", "DeepCurrent Inc", "StrataLogic Systems",
+        "AeroMind Labs", "CyberPulse Inc", "InnoVista AI",
+    ]
+    regions = ["EU-West (DE/FR/NL)", "EU-Central (DE/AT/CH)", "EU-North (SE/FI/DK)",
+               "EU-South (IT/ES/PT)", "EU-East (PL/CZ/RO)"]
+    company = rng.choice(company_names)
+    region = rng.choice(regions)
+    version = f"v{rng.randint(1,6)}.{rng.randint(0,9)}"
+    date = f"2026-{rng.randint(1,3):02d}-{rng.randint(1,28):02d}"
+    user_count = f"{rng.randint(10000, 5000000):,}"
+    system_name = system.name_template.format(company=company)
+    deployer = system.deployer_template.format(
+        company=company, region=region, user_count=user_count
+    )
+    description = system.description_template.format(
+        company=company, region=region, user_count=user_count
+    )
+    # 5. Group violations and red herrings by tool area
+    area_violations: Dict[str, List[str]] = {}
+    area_red_herrings: Dict[str, List[str]] = {}
+    for v in violations:
+        area_violations.setdefault(v.tool_area, []).append(v.doc_injection)
+    for r in red_herrings:
+        area_red_herrings.setdefault(r.tool_area, []).append(r.doc_injection)
+    # 6. Generate documents
+    fill_params = {
+        "system_name": system_name,
+        "version": version,
+        "deployer": deployer,
+        "date": date,
+        "annex_ref": system.annex_ref,
+        "risk_level": system.category.replace("_", " ").title(),
+        "record_count": f"{rng.randint(100000, 5000000):,}",
+        "data_period": f"20{rng.randint(19,23)}-2025",
+        "male_count": f"{rng.randint(400000, 800000):,}",
+        "male_pct": f"{rng.uniform(55, 68):.1f}",
+        "female_count": f"{rng.randint(200000, 500000):,}",
+        "female_pct": f"{rng.uniform(32, 45):.1f}",
+        "young_count": f"{rng.randint(200000, 400000):,}",
+        "young_pct": f"{rng.uniform(28, 40):.1f}",
+        "mid_count": f"{rng.randint(300000, 500000):,}",
+        "mid_pct": f"{rng.uniform(35, 48):.1f}",
+        "old_count": f"{rng.randint(50000, 200000):,}",
+        "old_pct": f"{rng.uniform(12, 25):.1f}",
+        "data_source_1": f"Primary: {rng.choice(['Enterprise API exports', 'Partner platform data', 'Direct user submissions'])}",
+        "data_source_2": f"Secondary: {rng.choice(['Public datasets (filtered)', 'Licensed commercial data', 'Internal test data'])}",
+        "retention": rng.choice(["5 years", "7 years", "3 years", "10 years"]),
+    }
+    def _build_doc(area: str) -> str:
+        template = _base_doc_template(area)
+        v_text = "\n\n".join(area_violations.get(area, ["(No issues identified in this area.)"]))
+        r_text = "\n\n".join(area_red_herrings.get(area, [""]))
+        filled = template.format(violations=v_text, red_herrings=r_text, **fill_params)
+        return filled
+    docs = {
+        "documentation_data": _build_doc("documentation"),
+        "training_data_info": _build_doc("training_data"),
+        "oversight_info": _build_doc("oversight"),
+        "transparency_info": _build_doc("transparency"),
+        "risk_assessment_info": _build_doc("risk_management"),
+        "logging_info": _build_doc("logging"),
+    }
+    # 7. Determine investigation tools (areas that have violations)
+    affected_areas = set(v.tool_area for v in violations)
+    tool_map = {
+        "documentation": "check_documentation",
+        "training_data": "audit_training_data",
+        "oversight": "verify_human_oversight",
+        "transparency": "check_transparency",
+        "risk_management": "assess_risk_management",
+        "logging": "check_logging",
+    }
+    investigation_tools = [tool_map[a] for a in [
+        "documentation", "training_data", "oversight",
+        "transparency", "risk_management", "logging"
+    ] if a in affected_areas]
+    # Ensure at least 2 investigation tools for meaningful audit
+    if len(investigation_tools) < 2:
+        extras = ["check_documentation", "check_transparency"]
+        for e in extras:
+            if e not in investigation_tools:
+                investigation_tools.append(e)
+            if len(investigation_tools) >= 2:
+                break
+    # 8. Build the scenario
+    scenario = AuditScenario(
+        scenario_id=f"procedural_{difficulty}_{seed:06d}",
+        title=f"Procedural: {system_name} ({difficulty.title()})",
+        difficulty=difficulty,
+        description=description,
+        system_name=system_name,
+        system_description=description,
+        system_category=system.category,
+        deployer_info=deployer,
+        correct_classification=system.category,
+        ground_truth_findings=[v.finding_id for v in violations],
+        required_remediation=[v.remediation_id for v in violations],
+        red_herrings=[r.id for r in red_herrings],
+        **docs,
+    )
+    # 9. Build state graph
+    scenario.graph = _build_procedural_graph(
+        investigation_tools=investigation_tools,
+        is_prohibited=(system.category == "prohibited"),
+    )
+    # 10. Randomize (adds company/region/version params)
+    scenario.randomize(seed)
+    return scenario

scenarios/registry.py CHANGED Viewed

The diff for this file is too large to render. See raw diff

server/engine.py CHANGED Viewed

@@ -149,13 +149,21 @@ class AuditScenario:
     required_remediation: List[str] = field(default_factory=list)
     red_herrings: List[str] = field(default_factory=list)
-    # Tool-specific data (returned when agent calls tools)
-    documentation_data: Dict[str, Any] = field(default_factory=dict)
-    training_data_info: Dict[str, Any] = field(default_factory=dict)
-    oversight_info: Dict[str, Any] = field(default_factory=dict)
-    transparency_info: Dict[str, Any] = field(default_factory=dict)
-    risk_assessment_info: Dict[str, Any] = field(default_factory=dict)
-    logging_info: Dict[str, Any] = field(default_factory=dict)
     # Randomization parameters (re-rolled on each reset)
     _rand_params: Dict[str, str] = field(default_factory=dict)
@@ -176,6 +184,9 @@ class AuditScenario:
             "company": rng.choice(company_names),
             "region": rng.choice(regions),
             "version": rng.choice(versions),
             "deployment_date": f"2026-{rng.randint(1,3):02d}-{rng.randint(1,28):02d}",
             "user_count": str(rng.randint(10000, 5000000)),
         }

     required_remediation: List[str] = field(default_factory=list)
     red_herrings: List[str] = field(default_factory=list)
+    # Investigation documents (rich text requiring analysis — no pre-digested verdicts)
+    documentation_data: str = ""
+    training_data_info: str = ""
+    oversight_info: str = ""
+    transparency_info: str = ""
+    risk_assessment_info: str = ""
+    logging_info: str = ""
+    # Deep-dive documents (revealed on repeat tool calls — adaptive depth)
+    deep_documentation: str = ""
+    deep_training_data: str = ""
+    deep_oversight: str = ""
+    deep_transparency: str = ""
+    deep_risk_assessment: str = ""
+    deep_logging: str = ""
     # Randomization parameters (re-rolled on each reset)
     _rand_params: Dict[str, str] = field(default_factory=dict)
             "company": rng.choice(company_names),
             "region": rng.choice(regions),
             "version": rng.choice(versions),
+            "date": f"2026-{rng.randint(1,3):02d}-{rng.randint(1,28):02d}",
+            "usercount": f"{rng.randint(10000, 5000000):,}",
+            # Keep old keys for backwards compat with get_param()
             "deployment_date": f"2026-{rng.randint(1,3):02d}-{rng.randint(1,28):02d}",
             "user_count": str(rng.randint(10000, 5000000)),
         }

server/environment.py CHANGED Viewed

@@ -1,11 +1,18 @@
 """
 EU AI Act Compliance Auditor — MCP Environment.
-Registers 10 MCP tools that the agent uses to audit AI systems for EU AI Act
-compliance. State-graph tracks audit progress. Terminal reward computed on
-verify_compliance with 6-component scoring.
-Tools:
   Investigation: get_system_overview, classify_system, check_documentation,
                  audit_training_data, verify_human_oversight, check_transparency,
                  assess_risk_management, check_logging
@@ -73,6 +80,7 @@ class ComplianceAuditorEnvironment(Environment):
         self._findings_submitted: List[str] = []
         self._remediation_submitted: List[str] = []
         self._discovered_info: Dict[str, bool] = {}
         # Progress tracking for state graph
         self._max_progress_depth: int = 0
@@ -251,6 +259,87 @@ class ComplianceAuditorEnvironment(Environment):
             step_count=self._step_count,
         )
     # ------------------------------------------------------------------
     # Tool implementations
     # ------------------------------------------------------------------
@@ -313,19 +402,32 @@ class ComplianceAuditorEnvironment(Environment):
         outcome = self._advance_state("get_system_overview")
         s = self._scenario
-        result = {
-            "system_name": s.system_name,
-            "description": s.system_description,
-            "deployer": s.deployer_info,
-            "category_claim": s.system_category if s.difficulty == "easy" else "To be determined by auditor",
-            "deployment_date": s.get_param("deployment_date"),
-            "region": s.get_param("region"),
-            "user_count": s.get_param("user_count"),
-            "company": s.get_param("company"),
-            "version": s.get_param("version"),
             "queries_remaining": QUERY_BUDGET - self._queries_used,
-        }
-        return json.dumps(result, indent=2)
     def _tool_classify_system(self, risk_category: str) -> str:
         budget_err = self._use_query()
@@ -350,17 +452,105 @@ class ComplianceAuditorEnvironment(Environment):
             "queries_remaining": QUERY_BUDGET - self._queries_used,
         })
     def _tool_check_documentation(self) -> str:
         budget_err = self._use_query()
         if budget_err:
             return budget_err
         self._discovered_info["documentation"] = True
         self._observation_after_investigation += 1
-        outcome = self._advance_state("check_documentation")
-        return json.dumps({
-            "documentation_review": self._scenario.documentation_data,
-            "queries_remaining": QUERY_BUDGET - self._queries_used,
-        }, indent=2)
     def _tool_audit_training_data(self) -> str:
         budget_err = self._use_query()
@@ -368,11 +558,8 @@ class ComplianceAuditorEnvironment(Environment):
             return budget_err
         self._discovered_info["training_data"] = True
         self._observation_after_investigation += 1
-        outcome = self._advance_state("audit_training_data")
-        return json.dumps({
-            "training_data_audit": self._scenario.training_data_info,
-            "queries_remaining": QUERY_BUDGET - self._queries_used,
-        }, indent=2)
     def _tool_verify_human_oversight(self) -> str:
         budget_err = self._use_query()
@@ -380,11 +567,8 @@ class ComplianceAuditorEnvironment(Environment):
             return budget_err
         self._discovered_info["oversight"] = True
         self._observation_after_investigation += 1
-        outcome = self._advance_state("verify_human_oversight")
-        return json.dumps({
-            "oversight_assessment": self._scenario.oversight_info,
-            "queries_remaining": QUERY_BUDGET - self._queries_used,
-        }, indent=2)
     def _tool_check_transparency(self) -> str:
         budget_err = self._use_query()
@@ -392,11 +576,8 @@ class ComplianceAuditorEnvironment(Environment):
             return budget_err
         self._discovered_info["transparency"] = True
         self._observation_after_investigation += 1
-        outcome = self._advance_state("check_transparency")
-        return json.dumps({
-            "transparency_assessment": self._scenario.transparency_info,
-            "queries_remaining": QUERY_BUDGET - self._queries_used,
-        }, indent=2)
     def _tool_assess_risk_management(self) -> str:
         budget_err = self._use_query()
@@ -404,11 +585,8 @@ class ComplianceAuditorEnvironment(Environment):
             return budget_err
         self._discovered_info["risk_management"] = True
         self._observation_after_investigation += 1
-        outcome = self._advance_state("assess_risk_management")
-        return json.dumps({
-            "risk_assessment": self._scenario.risk_assessment_info,
-            "queries_remaining": QUERY_BUDGET - self._queries_used,
-        }, indent=2)
     def _tool_check_logging(self) -> str:
         budget_err = self._use_query()
@@ -416,11 +594,8 @@ class ComplianceAuditorEnvironment(Environment):
             return budget_err
         self._discovered_info["logging"] = True
         self._observation_after_investigation += 1
-        outcome = self._advance_state("check_logging")
-        return json.dumps({
-            "logging_assessment": self._scenario.logging_info,
-            "queries_remaining": QUERY_BUDGET - self._queries_used,
-        }, indent=2)
     def _tool_submit_finding(self, finding: str, severity: str = "high") -> str:
         budget_err = self._use_query()
@@ -428,12 +603,58 @@ class ComplianceAuditorEnvironment(Environment):
             return budget_err
         self._findings_submitted.append(finding.lower().strip())
         outcome = self._advance_state("submit_finding")
-        return json.dumps({
             "finding_recorded": finding,
             "severity": severity,
             "total_findings": len(self._findings_submitted),
             "queries_remaining": QUERY_BUDGET - self._queries_used,
-        })
     def _tool_recommend_fix(self, finding: str, remediation: str, priority: int = 1) -> str:
         budget_err = self._use_query()
@@ -473,16 +694,53 @@ class ComplianceAuditorEnvironment(Environment):
         self._reward = breakdown.total()
-        return json.dumps({
             "done": True,
             "reward": self._reward,
-            "assessment_recorded": overall_assessment[:200],
             "reward_breakdown": breakdown.to_dict(),
-            "findings_submitted": len(self._findings_submitted),
-            "remediations_submitted": len(self._remediation_submitted),
-            "queries_used": self._queries_used,
-            "episode_duration_seconds": round(time.time() - self._start_time, 1),
-        }, indent=2)
     def close(self) -> None:
         pass

 """
 EU AI Act Compliance Auditor — MCP Environment.
+Investigation-grade environment where LLM agents audit AI systems for EU AI Act
+compliance. Tools return realistic regulatory documents (30-70 lines each) requiring
+genuine analysis — no pre-digested verdicts.
+Key features:
+  - Adaptive depth: repeat tool calls reveal forensic deep-dive content
+  - Dynamic state: environment responds to findings and remediation proposals
+  - Evidence chain validation: warns when findings lack supporting investigation
+  - 6-component terminal reward with anti-gaming (12 adversarial tests proven)
+  - 6 unique state graph topologies across 9 scenarios
+Tools (11):
   Investigation: get_system_overview, classify_system, check_documentation,
                  audit_training_data, verify_human_oversight, check_transparency,
                  assess_risk_management, check_logging
         self._findings_submitted: List[str] = []
         self._remediation_submitted: List[str] = []
         self._discovered_info: Dict[str, bool] = {}
+        self._tool_call_counts: Dict[str, int] = {}  # track repeat calls per tool
         # Progress tracking for state graph
         self._max_progress_depth: int = 0
             step_count=self._step_count,
         )
+    # ------------------------------------------------------------------
+    # Document rendering
+    # ------------------------------------------------------------------
+    def _render_doc(self, template: str) -> str:
+        """Replace __PLACEHOLDER__ tokens with randomized scenario params
+        and inject seed-based noise for truly unique documents per episode."""
+        result = template
+        if self._scenario and self._scenario._rand_params:
+            for key, val in self._scenario._rand_params.items():
+                result = result.replace(f"__{key.upper()}__", str(val))
+        # Seed-based noise injection: vary specific numbers slightly
+        # so no two episodes produce identical documents
+        if self._scenario:
+            rng = random.Random(hash(self._episode_id))
+            result = self._inject_noise(result, rng)
+        return result
+    def _inject_noise(self, text: str, rng: random.Random) -> str:
+        """Inject seed-based perturbations into document text.
+        Varies percentages and counts slightly (within realistic ranges)
+        to ensure every episode is genuinely unique, not just parameter swaps.
+        The violations remain detectable but exact numbers change.
+        """
+        import re
+        def _perturb_pct(match: re.Match) -> str:
+            """Perturb a percentage value by +-2 percentage points."""
+            val = float(match.group(1))
+            delta = rng.uniform(-2.0, 2.0)
+            new_val = max(0.1, min(99.9, val + delta))
+            return f"{new_val:.1f}%"
+        def _perturb_count(match: re.Match) -> str:
+            """Perturb a large count by +-5%."""
+            val = int(match.group(1).replace(",", ""))
+            if val < 100:
+                return match.group(0)  # don't perturb small numbers
+            delta = rng.uniform(-0.05, 0.05)
+            new_val = int(val * (1 + delta))
+            if val >= 1000:
+                return f"{new_val:,}"
+            return str(new_val)
+        # Perturb percentages (e.g., "34.2%" -> "35.1%")
+        text = re.sub(r'(\d{1,2}\.\d)%', _perturb_pct, text)
+        # Perturb large counts (e.g., "1,342,104" -> "1,378,921")
+        text = re.sub(r'(\d{1,3}(?:,\d{3})+)', _perturb_count, text)
+        return text
+    def _audit_progress_section(self) -> str:
+        """Dynamic audit progress appended to tool responses.
+        After the agent starts submitting findings, this section appears
+        in subsequent tool responses, showing what's been found so far.
+        Makes the environment feel responsive and alive.
+        """
+        parts = []
+        if self._classification_submitted:
+            parts.append(f"  Classification submitted: {self._classification_submitted.replace('_', ' ').title()}")
+        if self._findings_submitted:
+            parts.append(f"  Findings submitted: {len(self._findings_submitted)}")
+            for i, f in enumerate(self._findings_submitted[-3:], 1):
+                parts.append(f"    {i}. {f[:80]}")
+        if self._remediation_submitted:
+            parts.append(f"  Remediations proposed: {len(self._remediation_submitted)}")
+        areas = []
+        for area, checked in self._discovered_info.items():
+            if checked:
+                areas.append(area)
+        if areas:
+            parts.append(f"  Areas investigated: {', '.join(areas)}")
+        if not parts:
+            return ""
+        return "\n\nAUDIT PROGRESS:\n" + "\n".join(parts)
     # ------------------------------------------------------------------
     # Tool implementations
     # ------------------------------------------------------------------
         outcome = self._advance_state("get_system_overview")
         s = self._scenario
+        # Build investigation brief
+        brief = (
+            f"COMPLIANCE AUDIT ASSIGNMENT\n"
+            f"{'=' * 60}\n"
+            f"System:     {s.system_name} {s.get_param('version')}\n"
+            f"Deployer:   {self._render_doc(s.deployer_info)}\n"
+            f"Region:     {s.get_param('region')}\n"
+            f"Users:      {s.get_param('user_count')}\n"
+            f"Deployed:   {s.get_param('deployment_date')}\n"
+            f"{'=' * 60}\n\n"
+            f"SYSTEM DESCRIPTION:\n"
+            f"  {self._render_doc(s.system_description)}\n\n"
+            f"DEPLOYER'S CLAIMED RISK CATEGORY: "
+            f"{'\"' + s.system_category.replace('_', ' ').title() + '\"' if s.difficulty == 'easy' else 'Not provided — to be determined by auditor'}\n\n"
+            f"AUDIT SCOPE:\n"
+            f"  Conduct a thorough compliance audit under the EU AI Act.\n"
+            f"  Investigate documentation, training data, human oversight,\n"
+            f"  transparency, risk management, and logging as applicable.\n"
+            f"  Submit findings and remediation recommendations.\n\n"
+            f"  Tools available: {QUERY_BUDGET - self._queries_used} queries remaining."
+        )
+        return json.dumps({
+            "document_type": "System Overview & Audit Assignment",
+            "content": brief,
             "queries_remaining": QUERY_BUDGET - self._queries_used,
+        }, indent=2)
     def _tool_classify_system(self, risk_category: str) -> str:
         budget_err = self._use_query()
             "queries_remaining": QUERY_BUDGET - self._queries_used,
         })
+    def _remediation_overlay(self, area: str) -> str:
+        """Generate post-remediation overlay content for a re-investigated area.
+        When the agent recommends fixes and then re-checks a tool, the
+        environment shows how the proposed remediation would affect the area.
+        This makes the environment feel like a living system that responds
+        to the agent's actions.
+        """
+        if not self._remediation_submitted:
+            return ""
+        # Map areas to relevant remediation keywords
+        area_keywords = {
+            "documentation": ["documentation", "annex_iv", "technical_doc", "document"],
+            "training_data": ["bias", "audit", "data_governance", "training", "demographic"],
+            "oversight": ["human_review", "oversight", "human_oversight", "monitor"],
+            "transparency": ["disclosure", "transparency", "labeling", "notification"],
+            "risk_management": ["conformity", "risk_management", "assessment", "risk"],
+            "logging": ["logging", "traceability", "audit_trail", "record"],
+        }
+        relevant_remediations = []
+        keywords = area_keywords.get(area, [])
+        for rem in self._remediation_submitted:
+            if any(kw in rem for kw in keywords):
+                relevant_remediations.append(rem)
+        if not relevant_remediations:
+            return ""
+        lines = ["\n\nREMEDIATION STATUS UPDATE:"]
+        lines.append("  The following remediation actions have been proposed for this area:")
+        for i, rem in enumerate(relevant_remediations, 1):
+            lines.append(f"  {i}. {rem}")
+        lines.append("  Status: PROPOSED (pending implementation)")
+        lines.append("  Note: These are recommendations only. Re-investigation reflects")
+        lines.append("  the current pre-remediation state of the system.")
+        return "\n".join(lines)
+    def _get_deep_content(self, area: str) -> str:
+        """Get deep-dive content for repeat investigation calls."""
+        deep_map = {
+            "documentation": self._scenario.deep_documentation,
+            "training_data": self._scenario.deep_training_data,
+            "oversight": self._scenario.deep_oversight,
+            "transparency": self._scenario.deep_transparency,
+            "risk_management": self._scenario.deep_risk_assessment,
+            "logging": self._scenario.deep_logging,
+        }
+        return deep_map.get(area, "")
+    def _investigation_response(self, doc_type: str, content: str, area: str = "") -> str:
+        """Standard response format for investigation tools with dynamic state.
+        Features adaptive depth: repeat calls to the same tool reveal deeper
+        forensic analysis, additional statistics, and drill-down detail that
+        wasn't visible on the first pass.
+        """
+        # Track call count for adaptive depth
+        tool_key = area or doc_type
+        self._tool_call_counts[tool_key] = self._tool_call_counts.get(tool_key, 0) + 1
+        call_count = self._tool_call_counts[tool_key]
+        rendered = self._render_doc(content)
+        # Adaptive depth: on repeat calls, append deep-dive content
+        if call_count >= 2 and area:
+            deep = self._get_deep_content(area)
+            if deep:
+                rendered += "\n\n" + self._render_doc(deep)
+        # Add remediation overlay if agent has proposed fixes for this area
+        overlay = self._remediation_overlay(area)
+        if overlay:
+            rendered += overlay
+        # Add audit progress section
+        progress = self._audit_progress_section()
+        if progress:
+            rendered += progress
+        result = {
+            "document_type": doc_type,
+            "content": rendered,
+            "queries_remaining": QUERY_BUDGET - self._queries_used,
+        }
+        if call_count >= 2:
+            result["note"] = "DEEP DIVE: Additional forensic detail revealed on re-investigation."
+        return json.dumps(result, indent=2)
     def _tool_check_documentation(self) -> str:
         budget_err = self._use_query()
         if budget_err:
             return budget_err
         self._discovered_info["documentation"] = True
         self._observation_after_investigation += 1
+        self._advance_state("check_documentation")
+        return self._investigation_response("Technical Documentation Review", self._scenario.documentation_data, "documentation")
     def _tool_audit_training_data(self) -> str:
         budget_err = self._use_query()
             return budget_err
         self._discovered_info["training_data"] = True
         self._observation_after_investigation += 1
+        self._advance_state("audit_training_data")
+        return self._investigation_response("Training Data Audit Report", self._scenario.training_data_info, "training_data")
     def _tool_verify_human_oversight(self) -> str:
         budget_err = self._use_query()
             return budget_err
         self._discovered_info["oversight"] = True
         self._observation_after_investigation += 1
+        self._advance_state("verify_human_oversight")
+        return self._investigation_response("Human Oversight Assessment", self._scenario.oversight_info, "oversight")
     def _tool_check_transparency(self) -> str:
         budget_err = self._use_query()
             return budget_err
         self._discovered_info["transparency"] = True
         self._observation_after_investigation += 1
+        self._advance_state("check_transparency")
+        return self._investigation_response("Transparency & Disclosure Review", self._scenario.transparency_info, "transparency")
     def _tool_assess_risk_management(self) -> str:
         budget_err = self._use_query()
             return budget_err
         self._discovered_info["risk_management"] = True
         self._observation_after_investigation += 1
+        self._advance_state("assess_risk_management")
+        return self._investigation_response("Risk Management & Conformity Assessment", self._scenario.risk_assessment_info, "risk_management")
     def _tool_check_logging(self) -> str:
         budget_err = self._use_query()
             return budget_err
         self._discovered_info["logging"] = True
         self._observation_after_investigation += 1
+        self._advance_state("check_logging")
+        return self._investigation_response("Logging & Traceability Review", self._scenario.logging_info, "logging")
     def _tool_submit_finding(self, finding: str, severity: str = "high") -> str:
         budget_err = self._use_query()
             return budget_err
         self._findings_submitted.append(finding.lower().strip())
         outcome = self._advance_state("submit_finding")
+        # Evidence chain validation — check if agent investigated relevant areas
+        evidence_warnings = []
+        finding_lower = finding.lower()
+        EVIDENCE_MAP = {
+            "bias": "training_data",
+            "discrimination": "training_data",
+            "data_governance": "training_data",
+            "callback": "training_data",
+            "demographic": "training_data",
+            "oversight": "oversight",
+            "human_review": "oversight",
+            "human_oversight": "oversight",
+            "article_14": "oversight",
+            "documentation": "documentation",
+            "annex_iv": "documentation",
+            "technical_doc": "documentation",
+            "transparency": "transparency",
+            "disclosure": "transparency",
+            "article_50": "transparency",
+            "labeling": "transparency",
+            "watermark": "transparency",
+            "risk_management": "risk_management",
+            "conformity": "risk_management",
+            "article_9": "risk_management",
+            "logging": "logging",
+            "traceability": "logging",
+            "article_12": "logging",
+            "audit_trail": "logging",
+        }
+        relevant_areas = set()
+        for keyword, area in EVIDENCE_MAP.items():
+            if keyword in finding_lower:
+                relevant_areas.add(area)
+        uninvestigated = [a for a in relevant_areas if not self._discovered_info.get(a)]
+        if uninvestigated:
+            evidence_warnings.append(
+                f"Note: Finding references {', '.join(uninvestigated)} "
+                f"but you have not investigated {'this area' if len(uninvestigated) == 1 else 'these areas'} yet. "
+                f"Findings are stronger when supported by evidence from investigation tools."
+            )
+        result = {
             "finding_recorded": finding,
             "severity": severity,
             "total_findings": len(self._findings_submitted),
             "queries_remaining": QUERY_BUDGET - self._queries_used,
+        }
+        if evidence_warnings:
+            result["evidence_warnings"] = evidence_warnings
+        return json.dumps(result)
     def _tool_recommend_fix(self, finding: str, remediation: str, priority: int = 1) -> str:
         budget_err = self._use_query()
         self._reward = breakdown.total()
+        # Build detailed audit report showing what was found vs missed
+        ground_truth = self._scenario.ground_truth_findings
+        found_count = 0
+        missed = []
+        for gt in ground_truth:
+            gt_lower = gt.lower()
+            gt_tokens = set(gt_lower.replace("-", "_").split("_")) - {""}
+            matched = False
+            for sub in self._findings_submitted:
+                sub_tokens = set(sub.replace("-", "_").split("_")) - {""}
+                overlap = len(gt_tokens & sub_tokens)
+                if overlap >= 2 or (gt_tokens and overlap / len(gt_tokens) >= 0.4):
+                    matched = True
+                    break
+            if matched:
+                found_count += 1
+            else:
+                missed.append(gt)
+        # Classification feedback
+        correct_class = self._scenario.correct_classification.lower()
+        class_correct = self._classification_submitted == correct_class
+        audit_report = {
             "done": True,
             "reward": self._reward,
             "reward_breakdown": breakdown.to_dict(),
+            "audit_summary": {
+                "classification": {
+                    "submitted": self._classification_submitted or "(none)",
+                    "correct": correct_class,
+                    "match": "exact" if class_correct else "partial" if breakdown.classification > 0 else "wrong",
+                },
+                "findings": {
+                    "submitted": len(self._findings_submitted),
+                    "ground_truth_total": len(ground_truth),
+                    "matched": found_count,
+                    "missed": missed,
+                },
+                "remediation_count": len(self._remediation_submitted),
+                "areas_investigated": [k for k, v in self._discovered_info.items() if v],
+                "tool_calls_used": self._queries_used,
+                "episode_duration_seconds": round(time.time() - self._start_time, 1),
+            },
+        }
+        return json.dumps(audit_report, indent=2)
     def close(self) -> None:
         pass

server/gradio_landing.py CHANGED Viewed

@@ -270,8 +270,8 @@ def _audit_flow_html(scenario_id: str) -> str:
 def _hero_html() -> str:
     stats = [
-        ("SCENARIOS", "8"), ("MCP TOOLS", "11"), ("REWARD COMPS", "6"),
-        ("TIERS", "3"), ("QUERY BUDGET", "100"), ("EU DEADLINE", "Aug '26"),
     ]
     stat_boxes = "".join(
         f'<div class="stat"><div class="val">{v}</div><div class="label">{k}</div></div>'
@@ -281,9 +281,11 @@ def _hero_html() -> str:
     <div class="hero">
         <div><span class="accent-bar"></span><h1 style="display:inline;vertical-align:middle;">EU AI Act Compliance Auditor</h1></div>
         <p class="subtitle">
-            An MCP-based environment where LLM agents audit AI systems for EU AI Act compliance.
-            8 scenarios from chatbot transparency to prohibited social scoring.
-            Parameter randomization on every reset prevents memorization &mdash; agents must learn the <em>audit process</em>, not specific answers.
         </p>
         <div class="stats">{stat_boxes}</div>
     </div>"""
@@ -291,12 +293,12 @@ def _hero_html() -> str:
 def _design_cards_html() -> str:
     cards_data = [
-        ("\u00A7", "Real Regulatory Scenarios", "Based on actual EU AI Act articles: prohibited social scoring (Art. 5), high-risk hiring (Annex III), deepfake transparency (Art. 50), medical device audits. Not toy problems."),
-        ("\u2699", "Full Audit Toolkit", "11 MCP tools mirror a compliance auditor's workflow: system overview, risk classification, documentation review, bias audit, oversight verification, transparency check, risk assessment, logging verification."),
-        ("\u25C8", "State-Graph Audit Process", "Each scenario is a directed graph with progress / no_effect / worsened transitions. Partial credit via BFS depth along the optimal path. Wrong audit steps waste your query budget."),
-        ("\u25C9", "6-Component Reward", "Classification accuracy (20%), finding completeness (25%), finding precision (15%), remediation quality (15%), methodology adherence (15%), efficiency (10%). Anti-exploit design."),
-        ("\u27F3", "Parameter Randomization", "Company names, deployment dates, regions, and system versions re-rolled on every reset. 65K+ unique instances per scenario. Agents must generalize."),
-        ("\u23F1", "Enforcement: Aug 2026", "EU AI Act enforcement begins August 2, 2026. Fines up to EUR 35M or 7% of global revenue. Every company deploying AI in Europe needs compliance auditing."),
     ]
     cards = ""
     for icon, title, desc in cards_data:
@@ -427,6 +429,71 @@ def _leaderboard_html() -> str:
     return f'<table class="lb">{header}{rows}</table>'
 def _architecture_html() -> str:
     reward_items = [
         ("Classification Accuracy", "20%", "Correct risk category (prohibited / high_risk / limited_risk / minimal_risk)"),
@@ -500,6 +567,59 @@ def _architecture_html() -> str:
     </div>"""
 def _try_it_html() -> str:
     return f"""
     <div style="display:grid;grid-template-columns:1fr 1fr;gap:16px;">
@@ -550,18 +670,18 @@ def _pg_reset(difficulty: str) -> Tuple:
 def _pg_call(sid: str, tool_name: str, args_str: str) -> Tuple:
     if not sid:
-        return "Click Reset first", {"error": "No session"}
     with _pg_lock:
         env = _pg_sessions.get(sid)
     if not env:
-        return "Session expired", {"error": "Session not found"}
     fn = env._tool_fns.get(tool_name)
     if not fn:
-        return f"Unknown tool: {tool_name}", {"error": "Unknown tool"}
     try:
         kwargs = json.loads(args_str) if args_str and args_str.strip() else {}
     except json.JSONDecodeError:
-        return "Invalid JSON", {"error": "Bad JSON in arguments"}
     try:
         result = fn(**kwargs)
         parsed = json.loads(result) if isinstance(result, str) else result
@@ -571,9 +691,32 @@ def _pg_call(sid: str, tool_name: str, args_str: str) -> Tuple:
         status = f"Queries: {queries}/100 | Findings: {len(env._findings_submitted)} | Done: {done}"
         if done:
             status += f" | REWARD: {reward:.4f}"
-        return status, parsed
     except Exception as e:
-        return f"Error: {e}", {"error": str(e)}
 # ── Build the Gradio app ────────────────────────────────────────
@@ -620,13 +763,13 @@ def create_landing_app() -> gr.Blocks:
             # ── TAB 4: Playground ──
             with gr.Tab("Playground"):
                 gr.HTML(f"<h2>Interactive Audit</h2>")
-                gr.HTML(f'<p style="color:{MUTED};margin-bottom:12px;">Reset to start a session, then call tools in sequence. The environment tracks your audit state and scores your methodology.</p>')
                 session_state = gr.State(value=None)
                 pg_status = gr.Textbox(label="Status", interactive=False, value="Click Reset to begin")
                 with gr.Row(elem_classes="pg-row"):
-                    pg_diff = gr.Dropdown(choices=["easy", "medium", "hard"], value="easy", label="Difficulty")
                     pg_reset_btn = gr.Button("Reset", variant="primary", min_width=120)
                 with gr.Row(elem_classes="pg-row"):
@@ -634,25 +777,37 @@ def create_landing_app() -> gr.Blocks:
                     pg_args = gr.Textbox(label="Arguments (JSON)", placeholder='{"risk_category": "high_risk"}')
                     pg_call_btn = gr.Button("Call Tool", variant="secondary", min_width=120)
-                pg_result = gr.JSON(label="Result")
                 def _on_reset(diff):
                     sid, status, obs = _pg_reset(diff)
-                    return sid, status, obs
                 def _on_call(sid, tool, args):
-                    status, result = _pg_call(sid, tool, args)
-                    return status, result
-                pg_reset_btn.click(_on_reset, [pg_diff], [session_state, pg_status, pg_result])
-                pg_call_btn.click(_on_call, [session_state, pg_tool, pg_args], [pg_status, pg_result])
             # ── TAB 5: Architecture ──
             with gr.Tab("Architecture"):
                 gr.HTML(f"<h2>Environment Architecture</h2>")
                 gr.HTML(_architecture_html())
-            # ── TAB 6: Try It ──
             with gr.Tab("Try It"):
                 gr.HTML(f"<h2>Run the baseline yourself</h2>")
                 gr.HTML(_try_it_html())

 def _hero_html() -> str:
     stats = [
+        ("FIXED SCENARIOS", "9"), ("PROCEDURAL", "\u221E"), ("MCP TOOLS", "11"),
+        ("REWARD COMPS", "6"), ("TESTS", "74"), ("EU DEADLINE", "Aug '26"),
     ]
     stat_boxes = "".join(
         f'<div class="stat"><div class="val">{v}</div><div class="label">{k}</div></div>'
     <div class="hero">
         <div><span class="accent-bar"></span><h1 style="display:inline;vertical-align:middle;">EU AI Act Compliance Auditor</h1></div>
         <p class="subtitle">
+            An MCP environment where LLM agents audit AI systems for EU AI Act compliance.
+            Tools return investigation-grade regulatory documents &mdash; statistical tables, documentation inventories,
+            operational procedures &mdash; that require genuine analysis to identify violations.
+            No pre-digested verdicts. The agent must reason about evidence across 8 scenarios spanning
+            prohibited social scoring, high-risk hiring bias, medical device compliance, and multi-system corporate audits.
         </p>
         <div class="stats">{stat_boxes}</div>
     </div>"""
 def _design_cards_html() -> str:
     cards_data = [
+        ("\u00A7", "Investigation-Grade Documents", "Tools return 30-70 line regulatory documents: Annex IV cross-reference tables, demographic callback rate matrices, operational procedure extracts. No labels like 'COMPLIANT' or 'FAILED' &mdash; the agent must analyze the evidence and reason about violations."),
+        ("\u2699", "Dynamic Audit State", "The environment responds to the agent's actions in real-time. After submitting findings, subsequent tool calls show audit progress. After classification, investigation tools reflect the current audit context. The environment feels alive, not static."),
+        ("\u25C8", "5 Unique Graph Topologies", "Each scenario has a distinct state graph. Prohibited systems have short detection paths (5 steps). Full high-risk audits require 11 steps across all investigation tools. Wrong tool order triggers worsened transitions. BFS-based partial credit."),
+        ("\u25C9", "12 Anti-Gaming Tests", "Adversarial test suite proves the reward can't be gamed: skip investigation, spam findings, red herring bait, hallucinated findings, wrong classification isolation, fewer-than-optimal rushing, and 6 more exploit strategies. All proven ineffective."),
+        ("\u27F3", "Cross-Document Reasoning", "Findings require correlating evidence across multiple tools. Hiring bias: training data shows 23% callback gap (audit_training_data) while only 5% of rejections reviewed (verify_human_oversight). Social scoring: 'wellness app' framing (overview) vs. public service access impact (check_transparency)."),
+        ("\u221E", "Procedural Scenario Generator", "Beyond the 9 fixed scenarios, a seed-based procedural generator combines 5 system types &times; 16 violation templates &times; 5 red herrings to produce <strong>infinite unique scenarios</strong>. Use <code>procedural_medium_42</code> as scenario ID &mdash; every seed creates a different audit. Impossible to memorize."),
     ]
     cards = ""
     for icon, title, desc in cards_data:
     return f'<table class="lb">{header}{rows}</table>'
+def _investigation_depth_html() -> str:
+    """Show the before/after of investigation-grade tool responses."""
+    return f"""
+    <div class="arch-box" style="margin-bottom:16px;">
+        <h3 style="color:{GOLD};">Investigation-Grade Tool Responses</h3>
+        <p style="color:{MUTED};font-size:13px;margin-bottom:14px;">Tools return realistic regulatory documents requiring analysis — not pre-digested answers.</p>
+        <div style="display:grid;grid-template-columns:1fr 1fr;gap:16px;">
+            <div>
+                <h4 style="color:{ROSE};font-size:11px;letter-spacing:0.05em;margin-bottom:8px;">TYPICAL ENV (pre-digested)</h4>
+                <div class="code-block" style="font-size:11px;color:{MUTED};border-color:{ROSE}40;">{{"bias_assessment": "FAILED",
+ "callback_rate_gap": "23%",
+ "article_14_compliance": "NON-COMPLIANT",
+ "human_oversight": "INSUFFICIENT"}}</div>
+            </div>
+            <div>
+                <h4 style="color:{EMERALD};font-size:11px;letter-spacing:0.05em;margin-bottom:8px;">THIS ENV (investigation-grade)</h4>
+                <div class="code-block" style="font-size:11px;color:{EMERALD};border-color:{EMERALD}40;">CALLBACK RATES BY DEMOGRAPHIC:
+  Group             Rate     vs Baseline
+  Male applicants   34.2%    (baseline)
+  Female applicants 26.3%    -23.1%
+  Eastern EU        27.4%    -19.9%
+REVIEW STATISTICS (Q4 2025):
+  Auto-rejected:    208,375  (60.0%)
+  QA sample:         10,419  (5.0%)
+  QA overrides:         312  (3.0%)</div>
+            </div>
+        </div>
+        <p style="color:{MUTED};font-size:12px;margin-top:10px;">The agent must identify the 23% callback disparity from the table, recognize that 95% of rejections have no human review,
+        and correlate these across documents to form findings. No verdict is pre-computed.</p>
+    </div>"""
+def _antigaming_html() -> str:
+    """Anti-gaming test showcase."""
+    tests = [
+        ("Skip Investigation", "Submit correct findings without reading documents", "methodology = 0.0"),
+        ("Spam Findings", "Flood 16 findings hoping to hit ground truth", "precision < 0.50"),
+        ("Red Herring Bait", "Submit red herrings as violations", "precision = 0.0, completeness = 0.0"),
+        ("Immediate Verify", "Call verify_compliance with empty inputs", "total < 0.05"),
+        ("Wrong Classification", "Everything correct except risk category", "loses &ge; 10% gap"),
+        ("Skip Remediation", "Find all violations but propose no fixes", "remediation = 0.0"),
+        ("Classify Before Overview", "Skip system understanding", "methodology < 0.50"),
+        ("Rush (Fewer Steps)", "Game efficiency by taking fewer steps", "efficiency penalized"),
+        ("Hallucinate Findings", "Submit plausible-sounding false findings", "completeness < 0.40"),
+        ("Wrong Class on Prohibited", "Call prohibited system high_risk", "classification = 0.40"),
+        ("Perfect Run Sanity", "Legitimate perfect audit", "total > 0.85"),
+        ("Bounds Check", "All scenarios x all inputs", "reward in (0.001, 0.999)"),
+    ]
+    rows = ""
+    for name, strategy, result in tests:
+        rows += f'<tr><td style="color:{TEXT};font-weight:500;">{name}</td><td style="color:{MUTED};font-size:12px;">{strategy}</td><td style="color:{ROSE};font-family:monospace;font-size:12px;">{result}</td></tr>'
+    return f"""
+    <div class="arch-box" style="margin-bottom:16px;">
+        <h3 style="color:{GOLD};">12 Anti-Gaming Tests</h3>
+        <p style="color:{MUTED};font-size:13px;margin-bottom:10px;">Adversarial test suite proving the reward function is robust against common exploits.</p>
+        <table style="width:100%;border-collapse:collapse;font-size:13px;">
+            <tr><th style="text-align:left;color:{MUTED};font-size:10px;padding:6px 8px;border-bottom:1px solid {BORDER};">EXPLOIT</th>
+                <th style="text-align:left;color:{MUTED};font-size:10px;padding:6px 8px;border-bottom:1px solid {BORDER};">STRATEGY</th>
+                <th style="text-align:left;color:{MUTED};font-size:10px;padding:6px 8px;border-bottom:1px solid {BORDER};">RESULT</th></tr>
+            {rows}
+        </table>
+    </div>"""
 def _architecture_html() -> str:
     reward_items = [
         ("Classification Accuracy", "20%", "Correct risk category (prohibited / high_risk / limited_risk / minimal_risk)"),
     </div>"""
+def _compliance_map_html() -> str:
+    """EU AI Act article coverage matrix — unique to compliance audit domain."""
+    mappings = [
+        ("Article 5", "Prohibited Practices", "classify_system", ["hard_social_scoring"]),
+        ("Article 6 + Annex III", "High-Risk Classification", "classify_system, assess_risk_management", ["medium_hiring", "medium_credit", "medium_medical", "hard_multi_system"]),
+        ("Article 9", "Risk Management", "assess_risk_management", ["medium_hiring", "medium_credit", "medium_medical"]),
+        ("Article 10", "Data Governance", "audit_training_data", ["medium_hiring", "medium_credit", "medium_medical", "hard_multi_system"]),
+        ("Article 12", "Record-Keeping", "check_logging", ["medium_hiring", "medium_medical", "hard_deepfake", "hard_multi_system"]),
+        ("Article 13", "Transparency (Deployers)", "check_transparency, check_documentation", ["medium_hiring", "medium_credit", "medium_medical"]),
+        ("Article 14", "Human Oversight", "verify_human_oversight", ["medium_hiring", "medium_credit", "medium_medical", "hard_multi_system"]),
+        ("Article 50", "Transparency (All AI)", "check_transparency", ["easy_chatbot", "hard_deepfake"]),
+        ("Annex IV", "Technical Documentation", "check_documentation", ["medium_hiring", "medium_credit", "medium_medical", "hard_deepfake"]),
+        ("MDR + AI Act", "Medical Device Dual-Regulation", "check_documentation, assess_risk_management", ["medium_medical"]),
+    ]
+    rows = ""
+    for article, title, tools_str, scenarios in mappings:
+        tool_badges = " ".join(
+            f'<span style="background:{AMBER}15;color:{AMBER};padding:2px 8px;border-radius:4px;font-size:10px;font-family:monospace;">{t.strip()}</span>'
+            for t in tools_str.split(",")
+        )
+        scenario_badges = " ".join(
+            f'<span style="background:{BLUE}15;color:{BLUE};padding:2px 6px;border-radius:4px;font-size:10px;">{s}</span>'
+            for s in scenarios
+        )
+        rows += f'''<tr>
+            <td style="padding:10px 8px;border-bottom:1px solid {BORDER}10;white-space:nowrap;">
+                <strong style="color:{GOLD};">{article}</strong><br/>
+                <span style="color:{MUTED};font-size:11px;">{title}</span>
+            </td>
+            <td style="padding:10px 8px;border-bottom:1px solid {BORDER}10;">{tool_badges}</td>
+            <td style="padding:10px 8px;border-bottom:1px solid {BORDER}10;">{scenario_badges}</td>
+        </tr>'''
+    return f"""<table style="width:100%;border-collapse:collapse;">
+        <tr>
+            <th style="text-align:left;color:{MUTED};font-size:10px;letter-spacing:0.06em;padding:8px;border-bottom:1px solid {BORDER};">ARTICLE</th>
+            <th style="text-align:left;color:{MUTED};font-size:10px;letter-spacing:0.06em;padding:8px;border-bottom:1px solid {BORDER};">INVESTIGATION TOOLS</th>
+            <th style="text-align:left;color:{MUTED};font-size:10px;letter-spacing:0.06em;padding:8px;border-bottom:1px solid {BORDER};">SCENARIOS</th>
+        </tr>
+        {rows}
+    </table>
+    <div style="margin-top:16px;padding:16px;background:{CARD};border:1px solid {BORDER};border-radius:10px;">
+        <h4 style="color:{GOLD};font-size:13px;margin-bottom:8px;">Cross-Document Reasoning Requirements</h4>
+        <div style="color:{MUTED};font-size:12px;line-height:1.8;">
+            <strong style="color:{TEXT};">Hiring Bias (5 findings):</strong> audit_training_data reveals 23% callback gap &rarr; verify_human_oversight shows only 5% review rate &rarr; check_documentation confirms missing FRIA &rarr; agent must connect all three<br/>
+            <strong style="color:{TEXT};">Social Scoring (5 findings):</strong> get_system_overview frames as "wellness app" &rarr; check_transparency reveals service access impact &rarr; verify_human_oversight shows municipal integration &rarr; agent must recognize Art. 5 violation<br/>
+            <strong style="color:{TEXT};">Multi-System (6 findings):</strong> audit_training_data reveals cross-system data flows &rarr; check_documentation shows missing combined DPIA &rarr; verify_human_oversight reveals no unified oversight &rarr; compound risk emerges across documents<br/>
+            <strong style="color:{TEXT};">Medical Triage (4 findings):</strong> audit_training_data shows age-bias in 75+ cohort &rarr; check_documentation confirms retrospective-only validation &rarr; check_logging reveals no real-time monitoring &rarr; safety gap pattern
+        </div>
+    </div>"""
 def _try_it_html() -> str:
     return f"""
     <div style="display:grid;grid-template-columns:1fr 1fr;gap:16px;">
 def _pg_call(sid: str, tool_name: str, args_str: str) -> Tuple:
     if not sid:
+        return "Click Reset first", "(no session)", {"error": "No session"}
     with _pg_lock:
         env = _pg_sessions.get(sid)
     if not env:
+        return "Session expired", "(expired)", {"error": "Session not found"}
     fn = env._tool_fns.get(tool_name)
     if not fn:
+        return f"Unknown tool: {tool_name}", "(error)", {"error": "Unknown tool"}
     try:
         kwargs = json.loads(args_str) if args_str and args_str.strip() else {}
     except json.JSONDecodeError:
+        return "Invalid JSON", "(error)", {"error": "Bad JSON in arguments"}
     try:
         result = fn(**kwargs)
         parsed = json.loads(result) if isinstance(result, str) else result
         status = f"Queries: {queries}/100 | Findings: {len(env._findings_submitted)} | Done: {done}"
         if done:
             status += f" | REWARD: {reward:.4f}"
+        # Extract document content for rich display
+        doc_content = parsed.get("content", "")
+        if not doc_content and "audit_summary" in parsed:
+            # Verify compliance result — format nicely
+            summary = parsed["audit_summary"]
+            lines = [f"AUDIT COMPLETE — Reward: {parsed.get('reward', 0):.4f}"]
+            lines.append(f"\nClassification: {summary['classification']['submitted']} "
+                        f"({'correct' if summary['classification']['match'] == 'exact' else summary['classification']['match']})")
+            lines.append(f"Correct answer: {summary['classification']['correct']}")
+            lines.append(f"\nFindings: {summary['findings']['matched']}/{summary['findings']['ground_truth_total']} matched")
+            if summary["findings"]["missed"]:
+                lines.append("Missed:")
+                for m in summary["findings"]["missed"]:
+                    lines.append(f"  - {m}")
+            lines.append(f"\nAreas investigated: {', '.join(summary.get('areas_investigated', []))}")
+            lines.append(f"\nReward breakdown:")
+            for k, v in parsed.get("reward_breakdown", {}).items():
+                lines.append(f"  {k}: {v}")
+            doc_content = "\n".join(lines)
+        elif not doc_content:
+            doc_content = json.dumps(parsed, indent=2)
+        return status, doc_content, parsed
     except Exception as e:
+        return f"Error: {e}", str(e), {"error": str(e)}
 # ── Build the Gradio app ────────────────────────────────────────
             # ── TAB 4: Playground ──
             with gr.Tab("Playground"):
                 gr.HTML(f"<h2>Interactive Audit</h2>")
+                gr.HTML(f'<p style="color:{MUTED};margin-bottom:12px;">Reset to start a session, then call tools in sequence. The environment tracks your audit state and scores your methodology. Documents render below — this is what the agent sees.</p>')
                 session_state = gr.State(value=None)
                 pg_status = gr.Textbox(label="Status", interactive=False, value="Click Reset to begin")
                 with gr.Row(elem_classes="pg-row"):
+                    pg_diff = gr.Dropdown(choices=["easy", "medium", "hard"], value="medium", label="Difficulty")
                     pg_reset_btn = gr.Button("Reset", variant="primary", min_width=120)
                 with gr.Row(elem_classes="pg-row"):
                     pg_args = gr.Textbox(label="Arguments (JSON)", placeholder='{"risk_category": "high_risk"}')
                     pg_call_btn = gr.Button("Call Tool", variant="secondary", min_width=120)
+                pg_doc = gr.Textbox(label="Document Content (what the agent sees)", lines=20, interactive=False)
+                with gr.Accordion("Raw JSON Response", open=False):
+                    pg_result = gr.JSON(label="Raw")
                 def _on_reset(diff):
                     sid, status, obs = _pg_reset(diff)
+                    initial_doc = obs.get("message", "Session started. Call get_system_overview to begin.")
+                    return sid, status, initial_doc, obs
                 def _on_call(sid, tool, args):
+                    status, doc_content, result = _pg_call(sid, tool, args)
+                    return status, doc_content, result
+                pg_reset_btn.click(_on_reset, [pg_diff], [session_state, pg_status, pg_doc, pg_result])
+                pg_call_btn.click(_on_call, [session_state, pg_tool, pg_args], [pg_status, pg_doc, pg_result])
             # ── TAB 5: Architecture ──
             with gr.Tab("Architecture"):
                 gr.HTML(f"<h2>Environment Architecture</h2>")
+                gr.HTML(_investigation_depth_html())
+                gr.HTML(_antigaming_html())
                 gr.HTML(_architecture_html())
+            # ── TAB 6: Compliance Map ──
+            with gr.Tab("Compliance Map"):
+                gr.HTML(f"<h2>EU AI Act Article Coverage</h2>")
+                gr.HTML(f'<p style="color:{MUTED};margin-bottom:16px;">How each investigation tool maps to EU AI Act provisions, and which scenarios test each article.</p>')
+                gr.HTML(_compliance_map_html())
+            # ── TAB 7: Try It ──
             with gr.Tab("Try It"):
                 gr.HTML(f"<h2>Run the baseline yourself</h2>")
                 gr.HTML(_try_it_html())

tests/test_difficulty_calibration.py ADDED Viewed

	@@ -0,0 +1,106 @@

+"""Difficulty calibration tests.
+Proves that the environment is properly calibrated: a naive agent
+(same strategy for all scenarios) scores higher on easy scenarios
+than on hard ones. This validates the difficulty tier design.
+"""
+import json
+from server.environment import ComplianceAuditorEnvironment
+from scenarios.registry import SCENARIO_LIST
+def _naive_audit(scenario_id: str) -> float:
+    """Run a naive audit strategy — call all tools in order, submit generic findings."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id=scenario_id)
+    # Naive strategy: call everything, classify as high_risk, submit generic findings
+    env._tool_fns["get_system_overview"]()
+    env._tool_fns["classify_system"](risk_category="high_risk")
+    env._tool_fns["check_documentation"]()
+    env._tool_fns["audit_training_data"]()
+    env._tool_fns["verify_human_oversight"]()
+    env._tool_fns["check_transparency"]()
+    env._tool_fns["assess_risk_management"]()
+    env._tool_fns["check_logging"]()
+    env._tool_fns["submit_finding"](finding="documentation_gaps", severity="high")
+    env._tool_fns["submit_finding"](finding="bias_concern", severity="high")
+    env._tool_fns["submit_finding"](finding="insufficient_oversight", severity="medium")
+    env._tool_fns["recommend_fix"](finding="gaps", remediation="improve_documentation")
+    env._tool_fns["recommend_fix"](finding="bias", remediation="conduct_bias_audit")
+    result = json.loads(env._tool_fns["verify_compliance"](
+        risk_classification="high_risk",
+        overall_assessment="Multiple compliance gaps identified",
+        key_findings_summary="Documentation, bias, and oversight issues"
+    ))
+    return result["reward"]
+def test_hard_scenarios_have_more_findings_than_easy():
+    """Hard scenarios require identifying more ground truth findings.
+    This validates difficulty calibration — easy scenarios have 1-2 findings
+    while hard scenarios have 5-6, making them harder to get perfect on.
+    """
+    from scenarios.registry import get_scenario
+    easy_findings = []
+    hard_findings = []
+    for sc_info in SCENARIO_LIST:
+        sc = get_scenario(sc_info["id"], 42)
+        count = len(sc.ground_truth_findings)
+        if sc_info["difficulty"] == "easy":
+            easy_findings.append(count)
+        elif sc_info["difficulty"] == "hard":
+            hard_findings.append(count)
+    avg_easy = sum(easy_findings) / len(easy_findings)
+    avg_hard = sum(hard_findings) / len(hard_findings)
+    assert avg_hard > avg_easy * 2, \
+        f"Hard scenarios ({avg_hard:.1f} avg findings) should have at least 2x the findings of easy ({avg_easy:.1f})"
+def test_prohibited_scenario_punishes_wrong_classification():
+    """Classifying a prohibited system as high_risk should lose significant points.
+    The prohibited scenario is the hardest because the agent must see through
+    the deployer's framing to correctly identify it as prohibited.
+    """
+    # Naive agent classifies as high_risk (wrong for prohibited)
+    prohibited_score = _naive_audit("hard_social_scoring_prohibited_001")
+    # Perfect classification on the same scenario
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="hard_social_scoring_prohibited_001")
+    env._tool_fns["get_system_overview"]()
+    env._tool_fns["classify_system"](risk_category="prohibited")
+    env._tool_fns["submit_finding"](finding="prohibited_social_scoring_system")
+    env._tool_fns["recommend_fix"](finding="prohibited", remediation="immediate_system_shutdown")
+    result = json.loads(env._tool_fns["verify_compliance"](
+        risk_classification="prohibited",
+        overall_assessment="Prohibited system",
+        key_findings_summary="Social scoring"
+    ))
+    correct_score = result["reward"]
+    assert correct_score > prohibited_score, \
+        f"Correct prohibited ({correct_score:.3f}) should beat naive high_risk ({prohibited_score:.3f})"
+def test_medium_scenarios_spread_across_difficulty():
+    """Medium scenarios should produce different scores with the naive agent,
+    showing that they test different compliance challenges.
+    """
+    medium_scores = {}
+    for sc_info in SCENARIO_LIST:
+        if sc_info["difficulty"] == "medium":
+            medium_scores[sc_info["id"]] = _naive_audit(sc_info["id"])
+    scores = list(medium_scores.values())
+    spread = max(scores) - min(scores)
+    assert spread > 0.02, \
+        f"Medium scenarios should have score variance. Spread: {spread:.3f}, scores: {medium_scores}"

tests/test_evidence_chain.py ADDED Viewed

	@@ -0,0 +1,156 @@

+"""Evidence chain validation tests.
+Proves that the environment validates whether findings are supported by
+actual investigation. This is a unique feature — most environments just
+accept whatever findings are submitted without checking if the agent
+actually read the relevant documents.
+Tests:
+  1. Finding without investigation → warning
+  2. Finding after investigation → no warning
+  3. Multiple keyword matching works correctly
+  4. Verify_compliance shows missed findings
+  5. Verify_compliance shows classification accuracy
+"""
+import json
+from server.environment import ComplianceAuditorEnvironment
+def test_finding_without_investigation_warns():
+    """Submitting a bias finding without auditing training data triggers a warning."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    result = json.loads(env._tool_fns["submit_finding"](
+        finding="gender_bias_in_training_data", severity="critical"
+    ))
+    assert "evidence_warnings" in result, "Should warn about missing investigation"
+    assert any("training_data" in w for w in result["evidence_warnings"]), \
+        "Warning should mention training_data area"
+def test_finding_after_investigation_no_warning():
+    """Submitting a bias finding after auditing training data has no warning."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    env._tool_fns["get_system_overview"]()
+    env._tool_fns["audit_training_data"]()
+    result = json.loads(env._tool_fns["submit_finding"](
+        finding="gender_bias_in_training_data", severity="critical"
+    ))
+    assert "evidence_warnings" not in result, \
+        "No warning when area was investigated"
+def test_oversight_finding_warns_without_oversight_check():
+    """Submitting an oversight finding without verify_human_oversight warns."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    result = json.loads(env._tool_fns["submit_finding"](
+        finding="insufficient_human_oversight", severity="high"
+    ))
+    assert "evidence_warnings" in result
+    assert any("oversight" in w for w in result["evidence_warnings"])
+def test_transparency_finding_warns_without_transparency_check():
+    """Submitting a transparency finding without check_transparency warns."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="easy_chatbot_transparency_001")
+    result = json.loads(env._tool_fns["submit_finding"](
+        finding="missing_ai_disclosure_transparency", severity="high"
+    ))
+    assert "evidence_warnings" in result
+    assert any("transparency" in w for w in result["evidence_warnings"])
+def test_generic_finding_no_keyword_match_no_warning():
+    """A finding with no recognizable keywords produces no evidence warning."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    result = json.loads(env._tool_fns["submit_finding"](
+        finding="general_compliance_concern", severity="medium"
+    ))
+    assert "evidence_warnings" not in result, \
+        "Generic findings without keyword matches should not warn"
+def test_verify_shows_missed_findings():
+    """Verify_compliance response shows which ground truth findings were missed."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    env._tool_fns["get_system_overview"]()
+    env._tool_fns["classify_system"](risk_category="high_risk")
+    env._tool_fns["submit_finding"](finding="gender_bias_in_technical_screening")
+    result = json.loads(env._tool_fns["verify_compliance"](
+        risk_classification="high_risk",
+        overall_assessment="Bias found",
+        key_findings_summary="Gender bias"
+    ))
+    summary = result["audit_summary"]
+    assert summary["findings"]["matched"] == 1
+    assert summary["findings"]["ground_truth_total"] == 5
+    assert len(summary["findings"]["missed"]) == 4
+    assert "insufficient_human_oversight" in summary["findings"]["missed"]
+def test_verify_shows_classification_accuracy():
+    """Verify_compliance response shows classification match status."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="hard_social_scoring_prohibited_001")
+    env._tool_fns["get_system_overview"]()
+    # Wrong classification
+    result = json.loads(env._tool_fns["verify_compliance"](
+        risk_classification="high_risk",
+        overall_assessment="High risk system",
+        key_findings_summary="Various issues"
+    ))
+    assert result["audit_summary"]["classification"]["correct"] == "prohibited"
+    assert result["audit_summary"]["classification"]["match"] == "partial"
+    # Correct classification in a new episode
+    env2 = ComplianceAuditorEnvironment()
+    env2.reset(seed=42, scenario_id="hard_social_scoring_prohibited_001")
+    env2._tool_fns["get_system_overview"]()
+    result2 = json.loads(env2._tool_fns["verify_compliance"](
+        risk_classification="prohibited",
+        overall_assessment="Prohibited system",
+        key_findings_summary="Social scoring"
+    ))
+    assert result2["audit_summary"]["classification"]["match"] == "exact"
+def test_verify_shows_areas_investigated():
+    """Verify response shows which investigation areas were actually explored."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    env._tool_fns["get_system_overview"]()
+    env._tool_fns["classify_system"](risk_category="high_risk")
+    env._tool_fns["check_documentation"]()
+    env._tool_fns["audit_training_data"]()
+    result = json.loads(env._tool_fns["verify_compliance"](
+        risk_classification="high_risk",
+        overall_assessment="Partial audit",
+        key_findings_summary="Documentation and data issues"
+    ))
+    areas = result["audit_summary"]["areas_investigated"]
+    assert "overview" in areas
+    assert "documentation" in areas
+    assert "training_data" in areas
+    # These were NOT investigated
+    assert "oversight" not in areas
+    assert "transparency" not in areas

tests/test_investigation_depth.py ADDED Viewed

	@@ -0,0 +1,236 @@

+"""Investigation depth tests.
+Verifies that tool responses contain investigation-grade content requiring
+genuine analysis — not pre-digested verdicts.
+Tests prove:
+  1. Documents contain statistical evidence the agent must interpret
+  2. Red herrings are embedded naturally in the evidence
+  3. Cross-document reasoning is required to form findings
+  4. Document length scales with difficulty tier
+  5. Randomization changes parameterized content but not violations
+  6. Dynamic audit progress appears after findings
+"""
+import json
+from server.environment import ComplianceAuditorEnvironment
+from scenarios.registry import get_scenario, SCENARIO_LIST
+# ── Test 1: No pre-digested verdicts in tool responses ────────────
+def test_no_predigested_verdicts_in_documents():
+    """Investigation documents must NOT contain explicit compliance verdicts.
+    Labels like 'NON-COMPLIANT', 'FAILED', 'VIOLATION' hand the answer
+    to the agent. Documents should contain evidence, not conclusions.
+    """
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    predigested_labels = [
+        "NON-COMPLIANT", "NON_COMPLIANT", "FAILED", "VIOLATION FOUND",
+        "COMPLIANCE VIOLATION", "DOES NOT COMPLY",
+    ]
+    for tool_name in ["check_documentation", "audit_training_data",
+                      "verify_human_oversight", "check_transparency",
+                      "assess_risk_management", "check_logging"]:
+        result = json.loads(env._tool_fns[tool_name]())
+        content = result.get("content", "")
+        for label in predigested_labels:
+            assert label not in content, \
+                f"Pre-digested verdict '{label}' found in {tool_name} response"
+# ── Test 2: Statistical evidence present in training data audit ───
+def test_training_data_contains_statistical_tables():
+    """Training data audit must contain numerical evidence the agent
+    must interpret to identify bias — not just 'bias found' labels.
+    """
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    result = json.loads(env._tool_fns["audit_training_data"]())
+    content = result["content"]
+    # Must contain actual numbers (callback rates, percentages)
+    # Note: exact values vary due to seed-based noise injection
+    import re
+    pct_matches = re.findall(r'\d{1,2}\.\d%', content)
+    assert len(pct_matches) >= 4, \
+        f"Training data should contain multiple percentage figures, found {len(pct_matches)}"
+    # Must contain demographic categories
+    assert "Male" in content or "Female" in content, \
+        "Training data should reference demographic groups"
+    # Must NOT contain pre-computed verdict
+    assert "FAILED" not in content, \
+        "Training data should not contain pre-digested 'FAILED' verdict"
+# ── Test 3: Red herrings embedded naturally ───────────────────────
+def test_red_herrings_in_evidence():
+    """Red herring content should appear naturally in investigation documents,
+    not as separate labeled items the agent can trivially filter.
+    """
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    # The hiring scenario has red herrings: "prohibited_social_scoring" and "biometric_processing"
+    # The training data document should mention the separate fraud detection system
+    # (which is compliant and unrelated) as a natural red herring
+    result = json.loads(env._tool_fns["audit_training_data"]())
+    content = result["content"].lower()
+    assert "fraud" in content, \
+        "Red herring (compliant fraud system) should appear naturally in training data doc"
+# ── Test 4: Document length scales with difficulty ────────────────
+def test_document_length_scales_with_difficulty():
+    """Hard scenarios should have longer, more complex documents than easy ones."""
+    easy_total = 0
+    hard_total = 0
+    for sc_info in SCENARIO_LIST:
+        sc = get_scenario(sc_info["id"], 42)
+        total_len = sum(len(getattr(sc, field, "")) for field in [
+            "documentation_data", "training_data_info", "oversight_info",
+            "transparency_info", "risk_assessment_info", "logging_info",
+        ])
+        if sc_info["difficulty"] == "easy":
+            easy_total += total_len
+        elif sc_info["difficulty"] == "hard":
+            hard_total += total_len
+    avg_easy = easy_total / 2  # 2 easy scenarios
+    avg_hard = hard_total / 3  # 3 hard scenarios
+    assert avg_hard > avg_easy * 1.3, \
+        f"Hard scenarios ({avg_hard:.0f} chars avg) should be significantly larger than easy ({avg_easy:.0f} chars avg)"
+# ── Test 5: Randomization changes params but not violations ───────
+def test_randomization_preserves_violations():
+    """Different seeds should change surface parameters (company, date) but
+    the same ground truth findings should remain discoverable.
+    """
+    sc1 = get_scenario("medium_hiring_bias_001", seed=42)
+    sc2 = get_scenario("medium_hiring_bias_001", seed=12345)
+    # Ground truth findings must be identical
+    assert sc1.ground_truth_findings == sc2.ground_truth_findings
+    assert sc1.correct_classification == sc2.correct_classification
+    # At least some parameters must differ across seeds
+    params_differ = (
+        sc1.get_param("company") != sc2.get_param("company")
+        or sc1.get_param("version") != sc2.get_param("version")
+        or sc1.get_param("date") != sc2.get_param("date")
+        or sc1.get_param("usercount") != sc2.get_param("usercount")
+    )
+    assert params_differ, "Randomized parameters should differ across seeds"
+# ── Test 6: Randomization appears in rendered documents ───────────
+def test_randomization_in_rendered_documents():
+    """Rendered documents should contain randomized parameters, not placeholders."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    result = json.loads(env._tool_fns["get_system_overview"]())
+    content = result["content"]
+    # Should NOT contain raw placeholders
+    assert "__COMPANY__" not in content, "Placeholder __COMPANY__ should be replaced"
+    assert "__VERSION__" not in content, "Placeholder __VERSION__ should be replaced"
+    # Should contain actual randomized values
+    assert "v" in content.lower(), "Should contain version number"
+# ── Test 7: Dynamic audit progress appears after findings ─────────
+def test_dynamic_audit_progress():
+    """After submitting findings, subsequent tool calls should include
+    audit progress section showing what's been found.
+    """
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    # Before any findings — no progress section
+    r1 = json.loads(env._tool_fns["get_system_overview"]())
+    assert "AUDIT PROGRESS" not in r1["content"], \
+        "No progress section before any actions"
+    # After classification and a finding
+    env._tool_fns["classify_system"](risk_category="high_risk")
+    env._tool_fns["check_documentation"]()
+    env._tool_fns["submit_finding"](finding="test_finding", severity="high")
+    r2 = json.loads(env._tool_fns["audit_training_data"]())
+    assert "AUDIT PROGRESS" in r2["content"], \
+        "Progress section should appear after findings submitted"
+    assert "test_finding" in r2["content"], \
+        "Progress should show submitted findings"
+    assert "High Risk" in r2["content"], \
+        "Progress should show submitted classification"
+# ── Test 8: Each scenario has unique graph topology ───────────────
+def test_graph_diversity():
+    """At least 4 distinct graph topologies across 8 scenarios."""
+    sigs = set()
+    for sc_info in SCENARIO_LIST:
+        sc = get_scenario(sc_info["id"], 42)
+        sig = tuple(sorted(
+            (t.from_state, t.to_state, t.tool_name, t.outcome)
+            for t in sc.graph.transitions
+        ))
+        sigs.add(sig)
+    assert len(sigs) >= 4, f"Only {len(sigs)} unique graph topologies — need at least 4"
+# ── Test 9: Prohibited scenario does not reveal classification ────
+def test_prohibited_scenario_concealment():
+    """The prohibited system's overview should NOT reveal it's prohibited.
+    The deployer frames it as a wellness tool — agent must discover the truth.
+    """
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="hard_social_scoring_prohibited_001")
+    result = json.loads(env._tool_fns["get_system_overview"]())
+    content = result["content"].lower()
+    assert "prohibited" not in content, \
+        "Overview should not reveal the system is prohibited — that's what the agent must discover"
+    assert "wellness" in content or "civic" in content or "engagement" in content, \
+        "Overview should use the deployer's framing (wellness/civic engagement)"
+# ── Test 10: All 8 scenarios produce valid tool responses ─────────
+def test_all_scenarios_produce_rich_responses():
+    """Every scenario's investigation tools must return non-trivial content."""
+    for sc_info in SCENARIO_LIST:
+        env = ComplianceAuditorEnvironment()
+        env.reset(seed=42, scenario_id=sc_info["id"])
+        env._tool_fns["get_system_overview"]()
+        env._tool_fns["classify_system"](risk_category="high_risk")
+        for tool in ["check_documentation", "audit_training_data",
+                     "verify_human_oversight", "check_transparency",
+                     "assess_risk_management", "check_logging"]:
+            result = json.loads(env._tool_fns[tool]())
+            content = result.get("content", "")
+            assert len(content) > 100, \
+                f"{sc_info['id']}/{tool}: document too short ({len(content)} chars)"

tests/test_procedural.py ADDED Viewed

	@@ -0,0 +1,125 @@

+"""Procedural scenario generator tests.
+Proves the generator produces valid, diverse, and unique scenarios
+from any seed. This is the feature that makes the environment INFINITE.
+"""
+import json
+import pytest
+from scenarios.procedural import generate_procedural_scenario, SYSTEM_TEMPLATES, VIOLATION_TEMPLATES
+from scenarios.registry import get_scenario
+from server.environment import ComplianceAuditorEnvironment
+def test_generates_valid_scenario():
+    """Basic generation produces a well-formed AuditScenario."""
+    sc = generate_procedural_scenario(seed=42, difficulty="medium")
+    assert sc.scenario_id.startswith("procedural_")
+    assert sc.correct_classification in ("prohibited", "high_risk", "limited_risk", "minimal_risk")
+    assert len(sc.ground_truth_findings) >= 1
+    assert len(sc.graph.nodes) >= 6
+    assert sc.graph.optimal_path_length() >= 3
+def test_difficulty_controls_violation_count():
+    """Easy has fewer violations than hard."""
+    easy_counts = []
+    hard_counts = []
+    for seed in range(20):
+        easy = generate_procedural_scenario(seed, "easy")
+        hard = generate_procedural_scenario(seed, "hard")
+        easy_counts.append(len(easy.ground_truth_findings))
+        hard_counts.append(len(hard.ground_truth_findings))
+    assert sum(easy_counts) / len(easy_counts) < sum(hard_counts) / len(hard_counts), \
+        "Hard scenarios should have more violations on average"
+def test_different_seeds_produce_different_scenarios():
+    """No two seeds should produce identical scenarios."""
+    scenarios = {}
+    for seed in range(50):
+        sc = generate_procedural_scenario(seed, "medium")
+        key = (sc.system_name, tuple(sc.ground_truth_findings))
+        scenarios[seed] = key
+    unique = len(set(scenarios.values()))
+    assert unique >= 10, f"Only {unique} unique scenarios from 50 seeds — too little diversity"
+def test_prohibited_systems_in_hard_mode():
+    """Hard difficulty should sometimes generate prohibited systems."""
+    has_prohibited = False
+    for seed in range(100):
+        sc = generate_procedural_scenario(seed, "hard")
+        if sc.correct_classification == "prohibited":
+            has_prohibited = True
+            break
+    assert has_prohibited, "Hard mode should occasionally generate prohibited systems"
+def test_procedural_works_in_environment():
+    """Procedural scenarios work end-to-end through the environment."""
+    for seed in [1, 42, 100]:
+        env = ComplianceAuditorEnvironment()
+        obs = env.reset(seed=seed, scenario_id=f"procedural_medium_{seed}")
+        assert not env._done
+        assert env._scenario is not None
+        # Run basic audit
+        r = json.loads(env._tool_fns["get_system_overview"]())
+        assert "content" in r
+        assert len(r["content"]) > 100
+        env._tool_fns["classify_system"](risk_category="high_risk")
+        env._tool_fns["check_documentation"]()
+        result = json.loads(env._tool_fns["verify_compliance"](
+            risk_classification="high_risk",
+            overall_assessment="test",
+            key_findings_summary="test"
+        ))
+        assert 0.0 < result["reward"] < 1.0
+def test_procedural_via_get_scenario():
+    """Procedural IDs work through the standard get_scenario interface."""
+    sc = get_scenario("procedural_easy_42")
+    assert sc.scenario_id.startswith("procedural_")
+    assert sc.difficulty == "easy"
+    sc2 = get_scenario("procedural_hard_999")
+    assert sc2.difficulty == "hard"
+    assert len(sc2.ground_truth_findings) >= len(sc.ground_truth_findings)
+def test_all_system_types_reachable():
+    """Every system type template should be reachable from some seed."""
+    seen_systems = set()
+    for seed in range(200):
+        for diff in ["easy", "medium", "hard"]:
+            sc = generate_procedural_scenario(seed, diff)
+            seen_systems.add(sc.system_name.split(" ")[-2] + " " + sc.system_name.split(" ")[-1])
+    assert len(seen_systems) >= len(SYSTEM_TEMPLATES), \
+        f"Only {len(seen_systems)} system types reached from 200 seeds — some unreachable"
+def test_reward_bounds_procedural():
+    """All procedural scenarios produce rewards in (0.001, 0.999)."""
+    for seed in range(30):
+        for diff in ["easy", "medium", "hard"]:
+            env = ComplianceAuditorEnvironment()
+            env.reset(seed=seed, scenario_id=f"procedural_{diff}_{seed}")
+            env._tool_fns["get_system_overview"]()
+            env._tool_fns["classify_system"](risk_category="high_risk")
+            result = json.loads(env._tool_fns["verify_compliance"](
+                risk_classification="high_risk",
+                overall_assessment="test",
+                key_findings_summary="test"
+            ))
+            assert 0.0 < result["reward"] < 1.0, \
+                f"Reward {result['reward']} out of bounds @ seed={seed} diff={diff}"

tests/test_reward_hacking.py ADDED Viewed

	@@ -0,0 +1,339 @@

+"""Adversarial reward-hacking tests.
+Verifies the 6-component reward function cannot be gamed by common exploits:
+  1. Skip investigation — submit findings without reading documents
+  2. Spam findings — flood with every possible finding string
+  3. Red herring bait — submit red herrings as findings
+  4. Skip to verify — call verify_compliance immediately
+  5. Wrong classification — still score well on other components
+  6. Skip remediation — submit findings but no fixes
+  7. Classify without overview — skip get_system_overview
+  8. Fewer steps than optimal — game efficiency by rushing
+  9. Hallucinate findings — submit plausible-sounding false findings
+ 10. Perfect findings but wrong classification — test component isolation
+If any exploit scores above the expected cap, the reward function has a hole.
+"""
+from server.engine import compute_reward, safe_reward, RewardBreakdown
+from scenarios.registry import get_scenario
+# ── Test helpers ──────────────────────────────────────────────────
+def _score_scenario(
+    scenario_id: str = "medium_hiring_bias_001",
+    seed: int = 42,
+    classification: str = "",
+    findings: list = None,
+    remediation: list = None,
+    tool_sequence: list = None,
+    steps: int = 10,
+) -> RewardBreakdown:
+    sc = get_scenario(scenario_id, seed)
+    return compute_reward(
+        scenario=sc,
+        classification_submitted=classification,
+        findings_submitted=findings or [],
+        remediation_submitted=remediation or [],
+        tool_sequence=tool_sequence or [],
+        steps_taken=steps,
+    )
+def _total(scenario_id="medium_hiring_bias_001", **kwargs):
+    return _score_scenario(scenario_id, **kwargs).total()
+# ── Exploit 1: Skip investigation, submit correct findings ────────
+def test_skip_investigation_penalizes_methodology():
+    """Agent submits perfect findings without calling any investigation tools.
+    Must lose methodology score (15%) — no investigation tools in sequence.
+    """
+    r = _score_scenario(
+        classification="high_risk",
+        findings=[
+            "gender_bias_in_technical_screening",
+            "insufficient_human_oversight",
+            "incomplete_technical_documentation",
+            "missing_fundamental_rights_impact_assessment",
+            "inadequate_data_governance",
+        ],
+        remediation=[
+            "conduct_bias_audit_and_mitigation",
+            "implement_human_review_for_all_rejections",
+            "complete_annex_iv_technical_documentation",
+            "perform_fundamental_rights_impact_assessment",
+            "establish_data_governance_framework",
+        ],
+        tool_sequence=["submit_finding"] * 5 + ["recommend_fix"] * 5 + ["verify_compliance"],
+        steps=11,
+    )
+    assert r.methodology == 0.0, f"methodology should be 0 without investigation tools: {r.methodology}"
+    # Even with perfect findings, losing methodology (15%) is significant
+    assert r.total() < 0.86, f"skip-investigation still scores too high: {r.total()}"
+# ── Exploit 2: Spam every possible finding string ─────────────────
+def test_spam_findings_penalizes_precision():
+    """Agent submits 20 findings hoping to hit all ground truth by volume.
+    False positives should destroy finding_precision (15%).
+    """
+    r = _score_scenario(
+        classification="high_risk",
+        findings=[
+            "gender_bias_in_technical_screening",  # real
+            "insufficient_human_oversight",  # real
+            "incomplete_technical_documentation",  # real
+            "missing_fundamental_rights_impact_assessment",  # real
+            "inadequate_data_governance",  # real
+            # False positives below
+            "prohibited_social_scoring",  # red herring
+            "biometric_processing",  # red herring
+            "gdpr_violation", "data_breach", "missing_ce_marking",
+            "environmental_impact", "accessibility_violation",
+            "patent_infringement", "labor_law_violation",
+            "antitrust_concern", "tax_evasion",
+        ],
+        tool_sequence=["get_system_overview", "classify_system"] + ["submit_finding"] * 16,
+        steps=18,
+    )
+    # 5 true positives out of 16 = low precision
+    assert r.finding_precision < 0.5, f"spam precision too high: {r.finding_precision}"
+# ── Exploit 3: Submit red herrings as findings ────────────────────
+def test_red_herring_submission_penalized():
+    """Agent falls for red herrings and submits them as findings.
+    Must be penalized via finding_precision component.
+    """
+    r = _score_scenario(
+        classification="high_risk",
+        findings=[
+            "prohibited_social_scoring",  # red herring
+            "biometric_processing",  # red herring
+        ],
+        tool_sequence=["get_system_overview", "classify_system", "submit_finding", "submit_finding"],
+        steps=4,
+    )
+    # Two false positives, zero true positives
+    assert r.finding_precision == 0.0, f"red herrings should give 0 precision: {r.finding_precision}"
+    assert r.finding_completeness == 0.0, f"no real findings means 0 completeness: {r.finding_completeness}"
+    assert r.total() < 0.35, f"red-herring-only agent scores too high: {r.total()}"
+# ── Exploit 4: Skip straight to verify_compliance ─────────────────
+def test_immediate_verify_scores_near_floor():
+    """Agent calls verify_compliance with empty inputs on step 1.
+    Should score near the minimum (only safe_reward floor).
+    """
+    r = _score_scenario(
+        classification="",
+        findings=[],
+        remediation=[],
+        tool_sequence=["verify_compliance"],
+        steps=1,
+    )
+    assert r.classification == 0.0
+    assert r.finding_completeness == 0.0
+    assert r.finding_precision == 0.0
+    assert r.remediation == 0.0
+    assert r.methodology == 0.0
+    assert r.total() < 0.05, f"empty verify should be near floor: {r.total()}"
+# ── Exploit 5: Wrong classification but perfect everything else ────
+def test_wrong_classification_costs_20_percent():
+    """Agent gets everything right except classification.
+    Must lose the full 20% classification weight.
+    """
+    r_correct = _score_scenario(
+        classification="high_risk",
+        findings=["gender_bias_in_technical_screening", "insufficient_human_oversight",
+                  "incomplete_technical_documentation", "missing_fundamental_rights_impact_assessment",
+                  "inadequate_data_governance"],
+        remediation=["conduct_bias_audit_and_mitigation", "implement_human_review_for_all_rejections",
+                     "complete_annex_iv_technical_documentation", "perform_fundamental_rights_impact_assessment",
+                     "establish_data_governance_framework"],
+        tool_sequence=["get_system_overview", "classify_system", "check_documentation",
+                      "audit_training_data", "verify_human_oversight", "check_transparency",
+                      "assess_risk_management", "check_logging", "submit_finding", "submit_finding",
+                      "submit_finding", "submit_finding", "submit_finding",
+                      "recommend_fix", "recommend_fix", "recommend_fix",
+                      "recommend_fix", "recommend_fix", "verify_compliance"],
+        steps=19,
+    )
+    r_wrong = _score_scenario(
+        classification="minimal_risk",  # WRONG — should be high_risk
+        findings=["gender_bias_in_technical_screening", "insufficient_human_oversight",
+                  "incomplete_technical_documentation", "missing_fundamental_rights_impact_assessment",
+                  "inadequate_data_governance"],
+        remediation=["conduct_bias_audit_and_mitigation", "implement_human_review_for_all_rejections",
+                     "complete_annex_iv_technical_documentation", "perform_fundamental_rights_impact_assessment",
+                     "establish_data_governance_framework"],
+        tool_sequence=["get_system_overview", "classify_system", "check_documentation",
+                      "audit_training_data", "verify_human_oversight", "check_transparency",
+                      "assess_risk_management", "check_logging", "submit_finding", "submit_finding",
+                      "submit_finding", "submit_finding", "submit_finding",
+                      "recommend_fix", "recommend_fix", "recommend_fix",
+                      "recommend_fix", "recommend_fix", "verify_compliance"],
+        steps=19,
+    )
+    gap = r_correct.total() - r_wrong.total()
+    assert gap >= 0.10, f"wrong classification gap too small: {gap:.4f} (correct={r_correct.total():.4f}, wrong={r_wrong.total():.4f})"
+# ── Exploit 6: Perfect findings but zero remediation ──────────────
+def test_no_remediation_loses_15_percent():
+    """Agent identifies all findings but proposes no remediation.
+    Must lose the full 15% remediation weight.
+    """
+    r = _score_scenario(
+        classification="high_risk",
+        findings=["gender_bias_in_technical_screening", "insufficient_human_oversight",
+                  "incomplete_technical_documentation", "missing_fundamental_rights_impact_assessment",
+                  "inadequate_data_governance"],
+        remediation=[],  # no remediation!
+        tool_sequence=["get_system_overview", "classify_system", "check_documentation",
+                      "audit_training_data", "verify_human_oversight", "check_transparency",
+                      "assess_risk_management", "check_logging",
+                      "submit_finding", "submit_finding", "submit_finding",
+                      "submit_finding", "submit_finding", "verify_compliance"],
+        steps=14,
+    )
+    assert r.remediation == 0.0, f"no remediation should give 0: {r.remediation}"
+# ── Exploit 7: Classify without overview ──────────────────────────
+def test_classify_before_overview_penalizes_methodology():
+    """Agent classifies before gathering system overview.
+    Investigation order should be penalized in methodology.
+    """
+    r = _score_scenario(
+        classification="high_risk",
+        findings=["gender_bias_in_technical_screening"],
+        tool_sequence=["classify_system", "get_system_overview", "submit_finding"],
+        steps=3,
+    )
+    # classify_system before get_system_overview is an order violation
+    assert r.methodology < 0.5, f"wrong order methodology too high: {r.methodology}"
+# ── Exploit 8: Fewer steps than optimal games efficiency ──────────
+def test_fewer_steps_than_optimal_penalized():
+    """Agent takes fewer steps than the optimal path.
+    This means skipping investigation — efficiency should be penalized.
+    """
+    r_rushed = _score_scenario(
+        classification="high_risk",
+        findings=["gender_bias_in_technical_screening"],
+        tool_sequence=["verify_compliance"],
+        steps=2,  # way fewer than optimal
+    )
+    r_proper = _score_scenario(
+        classification="high_risk",
+        findings=["gender_bias_in_technical_screening"],
+        tool_sequence=["get_system_overview", "classify_system", "check_documentation",
+                      "audit_training_data", "verify_human_oversight",
+                      "check_transparency", "assess_risk_management", "check_logging",
+                      "submit_finding", "verify_compliance"],
+        steps=12,
+    )
+    assert r_proper.efficiency > r_rushed.efficiency, \
+        f"rushed ({r_rushed.efficiency}) should not beat proper ({r_proper.efficiency})"
+# ── Exploit 9: Hallucinate plausible-sounding findings ────────────
+def test_hallucinated_findings_low_precision():
+    """Agent submits plausible-sounding but wrong findings.
+    Token-based matching should not match these.
+    """
+    r = _score_scenario(
+        classification="high_risk",
+        findings=[
+            "ai_model_lacks_interpretability",
+            "no_audit_trail_for_decisions",
+            "potential_discrimination_in_outputs",
+            "insufficient_testing_methodology",
+        ],
+        tool_sequence=["get_system_overview", "classify_system", "check_documentation",
+                      "submit_finding", "submit_finding", "submit_finding", "submit_finding"],
+        steps=7,
+    )
+    # These don't token-match the ground truth findings
+    assert r.finding_completeness < 0.4, f"hallucinated findings match too well: {r.finding_completeness}"
+# ── Exploit 10: Perfect on prohibited scenario with wrong class ───
+def test_prohibited_classified_as_high_risk():
+    """Agent correctly finds violations but classifies prohibited as high_risk.
+    Partial classification match should give 40% credit, not full.
+    """
+    r = _score_scenario(
+        scenario_id="hard_social_scoring_prohibited_001",
+        classification="high_risk",  # wrong — should be prohibited
+        findings=["prohibited_social_scoring_system", "disguised_as_voluntary_wellness",
+                  "affects_access_to_public_services"],
+        tool_sequence=["get_system_overview", "classify_system", "submit_finding",
+                      "submit_finding", "submit_finding", "verify_compliance"],
+        steps=6,
+    )
+    assert r.classification == 0.4, f"adjacent classification should be 0.4: {r.classification}"
+# ── Sanity: perfect run on medium hiring ──────────────────────────
+def test_perfect_run_scores_high():
+    """A perfect audit should score above 0.90."""
+    r = _score_scenario(
+        classification="high_risk",
+        findings=["gender_bias_in_technical_screening", "insufficient_human_oversight",
+                  "incomplete_technical_documentation", "missing_fundamental_rights_impact_assessment",
+                  "inadequate_data_governance"],
+        remediation=["conduct_bias_audit_and_mitigation", "implement_human_review_for_all_rejections",
+                     "complete_annex_iv_technical_documentation",
+                     "perform_fundamental_rights_impact_assessment",
+                     "establish_data_governance_framework"],
+        tool_sequence=["get_system_overview", "classify_system", "check_documentation",
+                      "audit_training_data", "verify_human_oversight", "check_transparency",
+                      "assess_risk_management", "check_logging",
+                      "submit_finding", "submit_finding", "submit_finding",
+                      "submit_finding", "submit_finding",
+                      "recommend_fix", "recommend_fix", "recommend_fix",
+                      "recommend_fix", "recommend_fix",
+                      "verify_compliance"],
+        steps=19,
+    )
+    assert r.total() > 0.85, f"perfect run too low: {r.total()}"
+    assert r.classification == 1.0
+    assert r.methodology > 0.8
+# ── Bounds: all rewards strictly in (0, 1) ────────────────────────
+def test_reward_bounds_all_scenarios():
+    """Every scenario × various inputs must produce reward in (0.001, 0.999)."""
+    from scenarios.registry import SCENARIO_LIST
+    for sc_info in SCENARIO_LIST:
+        for cls in ["", "prohibited", "high_risk", "limited_risk", "minimal_risk", "garbage"]:
+            for findings in [[], ["some_finding"], ["a", "b", "c", "d", "e", "f"]]:
+                r = _total(
+                    scenario_id=sc_info["id"],
+                    classification=cls,
+                    findings=findings,
+                    tool_sequence=["verify_compliance"],
+                    steps=1,
+                )
+                assert 0.0 < r < 1.0, \
+                    f"out of range: {r} @ {sc_info['id']} cls={cls} findings={len(findings)}"

tests/test_stress.py ADDED Viewed

	@@ -0,0 +1,94 @@

+"""Stress tests — prove robustness across many random seeds and scenarios.
+Runs 50 seeds × 9 scenarios = 450 episodes to verify:
+  1. Every scenario instantiates without error
+  2. Every tool returns valid JSON with content
+  3. Every reward is strictly in (0.001, 0.999)
+  4. No two seeds produce identical documents (randomization works)
+  5. State graphs are consistent across seeds
+  6. Adaptive depth works across all scenarios
+"""
+import json
+import pytest
+from server.environment import ComplianceAuditorEnvironment
+from scenarios.registry import get_scenario, SCENARIO_LIST
+SEEDS = list(range(1, 51))  # 50 seeds
+@pytest.mark.parametrize("scenario_info", SCENARIO_LIST, ids=lambda s: s["id"])
+def test_all_seeds_produce_valid_episodes(scenario_info):
+    """Every seed × scenario produces valid tool responses and bounded reward."""
+    sid = scenario_info["id"]
+    seen_overviews = set()
+    for seed in SEEDS:
+        env = ComplianceAuditorEnvironment()
+        env.reset(seed=seed, scenario_id=sid)
+        # Call overview
+        overview = json.loads(env._tool_fns["get_system_overview"]())
+        assert "content" in overview, f"seed={seed}: overview missing content"
+        assert len(overview["content"]) > 50, f"seed={seed}: overview too short"
+        # Use a section that contains randomized params (company, version appear in middle)
+        seen_overviews.add(overview["content"][50:200])
+        # Classify
+        env._tool_fns["classify_system"](risk_category="high_risk")
+        # Call one investigation tool
+        doc = json.loads(env._tool_fns["check_documentation"]())
+        assert "content" in doc, f"seed={seed}: doc missing content"
+        assert len(doc["content"]) > 100, f"seed={seed}: doc too short"
+        # Submit finding + verify
+        env._tool_fns["submit_finding"](finding="test_finding")
+        result = json.loads(env._tool_fns["verify_compliance"](
+            risk_classification="high_risk",
+            overall_assessment="test",
+            key_findings_summary="test"
+        ))
+        reward = result["reward"]
+        assert 0.0 < reward < 1.0, f"seed={seed}: reward {reward} out of bounds"
+    # Randomization: across 50 seeds, we should see at least 3 unique overviews
+    assert len(seen_overviews) >= 3, \
+        f"Only {len(seen_overviews)} unique overviews across 50 seeds — randomization may be broken"
+@pytest.mark.parametrize("scenario_info", SCENARIO_LIST, ids=lambda s: s["id"])
+def test_graph_consistency_across_seeds(scenario_info):
+    """State graph topology must be identical regardless of seed."""
+    sid = scenario_info["id"]
+    base_graph = None
+    for seed in [1, 42, 100, 999]:
+        sc = get_scenario(sid, seed)
+        sig = tuple(sorted(
+            (t.from_state, t.to_state, t.tool_name, t.outcome)
+            for t in sc.graph.transitions
+        ))
+        if base_graph is None:
+            base_graph = sig
+        else:
+            assert sig == base_graph, f"Graph differs at seed={seed}"
+def test_adaptive_depth_on_medium_hiring():
+    """Repeat calls reveal deeper content on the flagship scenario."""
+    env = ComplianceAuditorEnvironment()
+    env.reset(seed=42, scenario_id="medium_hiring_bias_001")
+    env._tool_fns["get_system_overview"]()
+    env._tool_fns["classify_system"](risk_category="high_risk")
+    r1 = json.loads(env._tool_fns["audit_training_data"]())
+    r2 = json.loads(env._tool_fns["audit_training_data"]())
+    assert len(r2["content"]) > len(r1["content"]), \
+        "Second call should reveal deeper content"
+    assert "DEEP DIVE" in r2["content"], \
+        "Second call should contain forensic deep dive"
+    assert "note" in r2, \
+        "Second call should have a note about deep dive"