Itachi-1824 commited on
Commit
107f92d
·
1 Parent(s): 9109173

feat: investigation-grade overhaul + procedural generation

Browse files

- investigation-grade documents: 92KB of regulatory text requiring genuine analysis
- 9 hand-crafted scenarios + infinite procedural generation (91K+ unique combos)
- adaptive document depth: repeat tool calls reveal forensic deep-dive content
- dynamic audit state: environment responds to findings and remediations
- evidence chain validation: warns when findings lack supporting investigation
- post-remediation overlays: environment reacts to proposed fixes
- seed-based noise injection: every episode has unique numbers/percentages
- 6 unique state graph topologies across scenarios
- 7-tab gradio dashboard with compliance map and anti-gaming showcase
- 74 tests across 8 files (anti-gaming, evidence chain, stress, procedural)
- 10-model benchmark runner for leaderboard generation

README.md CHANGED
@@ -11,72 +11,127 @@ tags:
11
 
12
  # EU AI Act Compliance Auditor
13
 
14
- An MCP-based environment where LLM agents audit AI systems for EU AI Act compliance from risk classification to violation identification to remediation planning. Scenarios based on real regulatory articles. Parameter randomization on every reset prevents memorization; agents must learn the **audit process**, not specific answers.
15
 
16
- ## Why This Environment
17
 
18
- The EU AI Act's major enforcement deadline is **August 2, 2026** — less than 4 months away. Every company deploying AI in Europe faces fines up to **EUR 35 million or 7% of global revenue**. Yet no automated compliance auditing benchmark exists. This environment fills that gap with 8 realistic scenarios across the full spectrum of EU AI Act risk categories.
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Stats
21
 
22
  | Metric | Value |
23
  |--------|-------|
24
- | Scenarios | 8 |
25
- | MCP Tools | 11 |
26
- | Reward Components | 6 |
27
- | Difficulty Tiers | 3 (easy / medium / hard) |
28
- | State Graph Nodes | 12 per scenario |
 
 
 
 
 
 
29
  | Parameter Randomization | Company, region, version, dates per reset |
30
 
31
- ## Tools (MCP Interface)
32
 
33
- ### Investigation
34
- | Tool | Description |
35
- |------|-------------|
36
- | `get_system_overview` | Gather system description, deployer info, deployment context |
37
- | `classify_system` | Classify risk level (prohibited / high_risk / limited_risk / minimal_risk) |
38
- | `check_documentation` | Review Annex IV technical documentation completeness |
39
- | `audit_training_data` | Check bias, representativeness, data governance (Article 10) |
40
- | `verify_human_oversight` | Verify Article 14 human-in-the-loop mechanisms |
41
- | `check_transparency` | Check Article 50 transparency obligations |
42
- | `assess_risk_management` | Review risk management system (Article 9) |
43
- | `check_logging` | Verify automatic logging and traceability (Article 12) |
44
 
45
- ### Resolution
46
- | Tool | Description |
47
- |------|-------------|
48
- | `submit_finding` | Report a compliance violation (call per finding) |
49
- | `recommend_fix` | Propose remediation with priority |
50
- | `verify_compliance` | Final determination — triggers terminal reward |
51
 
52
- ## Scenarios
 
 
 
53
 
54
- ### Easy
55
- - **Customer Service Chatbot** — Limited-risk system missing AI disclosure (Article 50)
56
- - **Music Recommendation Engine** — Minimal-risk system needing voluntary code of conduct
57
 
58
- ### Medium
59
- - **AI Resume Screener** — High-risk hiring AI (Annex III) with gender bias, missing oversight, incomplete documentation
60
- - **Credit Scoring Model** — High-risk fintech system with opaque features and no right to human review
61
- - **Emergency Triage AI** — Medical device with age bias and no prospective clinical validation
62
 
63
- ### Hard
64
- - **Citizen Wellness App** **PROHIBITED** social scoring system disguised as a voluntary wellness tool. Must identify it as prohibited under Article 5(1)(c)
65
- - **AI Content Studio** Deepfake generation platform missing all Article 50 transparency obligations
66
- - **Corporate AI Portfolio** — Multi-system audit with 4 interconnected AI systems sharing a data lake. Must identify compound risks and cross-system data flow issues
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  ## 6-Component Reward
69
 
70
- | Component | Weight | Description |
71
  |-----------|--------|-------------|
72
- | Classification | 20% | Correct risk category identification |
73
- | Finding Completeness | 25% | Recall of ground-truth violations |
74
- | Finding Precision | 15% | Penalty for false positives / red herring findings |
75
- | Remediation Quality | 15% | Correct fixes in priority order |
76
- | Methodology | 15% | Followed correct audit sequence (overview classify investigate → find → fix → verify) |
77
- | Efficiency | 10% | Queries used vs optimal path |
 
 
78
 
79
- All rewards clamped to (0.01, 0.99) for OpenEnv validator compliance.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  ## Quick Start
82
 
@@ -87,67 +142,26 @@ pip install "openenv-core[core]" fastmcp gradio httpx openai
87
  # Run locally
88
  uvicorn server.app:app --host 0.0.0.0 --port 7860
89
 
90
- # Run inference
91
  export API_BASE_URL="https://integrate.api.nvidia.com/v1"
92
- export MODEL_NAME="google/gemma-4-31b-it"
93
- export HF_TOKEN="your-key"
94
  python inference.py --space https://Itachi1824-compliance-auditor-env.hf.space
95
 
96
  # Docker
97
  docker build -t compliance-env . && docker run -p 7860:7860 compliance-env
98
- ```
99
-
100
- ## API
101
-
102
- ### Standard OpenEnv
103
- - `POST /reset` — Start new episode
104
- - `POST /step` — Execute action
105
- - `GET /state` — Get episode state
106
- - `GET /health` — Health check
107
 
108
- ### Custom HTTP Session API
109
- - `POST /api/reset` — Create session, returns tools + observation
110
- - `POST /api/call_tool` — Call an audit tool in a session
111
- - `POST /api/close` — End session
112
-
113
- ## Architecture
114
-
115
- ```
116
- compliance_env/
117
- ├── server/
118
- │ ├── app.py # FastAPI + sessions + Gradio UI
119
- │ ├── environment.py # MCP environment with 11 tools
120
- │ └── engine.py # State graph + 6-component reward
121
- ├── scenarios/
122
- │ └── registry.py # 8 scenarios with state graphs
123
- ├── client.py # HTTP client for inference
124
- ├── inference.py # OpenAI function-calling agent
125
- ├── models.py # Pydantic observation/state models
126
- ├── Dockerfile # Port 7860, python:3.11-slim
127
- └── openenv.yaml # OpenEnv manifest with tasks
128
  ```
129
 
130
- ## Baseline Scores
131
-
132
- Tested against live HF Space with NVIDIA NIM models:
133
-
134
- | Rank | Model | Easy | Medium | Hard | Overall |
135
- |------|-------|------|--------|------|---------|
136
- | 1 | stepfun-ai/step-3.5-flash | 0.473 | 0.425 | 0.404 | **0.434** |
137
- | 2 | mistralai/mistral-small-4-119b | 0.457 | 0.425 | 0.348 | **0.410** |
138
- | 3 | deepseek-ai/deepseek-v3.1 | 0.442 | 0.425 | 0.348 | **0.405** |
139
 
140
- Hard scenarios genuinely challenge frontier models — the prohibited social scoring detection requires the agent to see through deliberate misdirection ("wellness app" that's actually social scoring affecting public service access).
141
-
142
- ## Sample Output
143
-
144
- ```
145
- [START] task=easy_chatbot_transparency_001 env=compliance_auditor_env model=google/gemma-4-31b-it
146
- [STEP] step=1 action=get_system_overview reward=0.00 done=false error=null
147
- [STEP] step=2 action=classify_system reward=0.00 done=false error=null
148
- [STEP] step=3 action=check_documentation reward=0.00 done=false error=null
149
- [STEP] step=4 action=check_transparency reward=0.00 done=false error=null
150
- [STEP] step=5 action=submit_finding reward=0.00 done=false error=null
151
- [STEP] step=6 action=verify_compliance reward=0.46 done=true error=null
152
- [END] success=true steps=6 score=0.457 rewards=0.00,0.00,0.00,0.00,0.00,0.46
153
- ```
 
11
 
12
  # EU AI Act Compliance Auditor
13
 
14
+ An MCP environment where LLM agents audit AI systems for EU AI Act compliance. Tools return **investigation-grade regulatory documents** statistical tables, documentation inventories, operational procedures that require genuine analysis to identify violations. No pre-digested verdicts. The agent must reason about evidence across documents to find compliance gaps.
15
 
16
+ ## What Makes This Different
17
 
18
+ Most compliance environments hand the agent pre-labeled answers: `"bias_assessment": "FAILED"`. This environment returns the **raw evidence**:
19
+
20
+ ```
21
+ CALLBACK RATES BY DEMOGRAPHIC (Technical Roles Only):
22
+ Group Rate vs Baseline
23
+ Male applicants 34.2% (baseline)
24
+ Female applicants 26.3% -23.1%
25
+ Eastern EU 27.4% -19.9%
26
+ ```
27
+
28
+ The agent must identify the 23% callback disparity from the table, recognize it as gender bias, cross-reference with the oversight document showing only 5% of rejections are reviewed, and connect these into actionable findings.
29
 
30
  ## Stats
31
 
32
  | Metric | Value |
33
  |--------|-------|
34
+ | Fixed Scenarios | 9 across 3 difficulty tiers |
35
+ | Procedural Scenarios | Infinite (seed-based generation) |
36
+ | MCP Tools | 11 (8 investigation + 3 resolution) |
37
+ | Reward Components | 6 (weighted, anti-gaming) |
38
+ | Graph Topologies | 6 unique per-scenario |
39
+ | Document Depth | 500-3,275 chars per tool response |
40
+ | Total Document Content | 77K+ chars across all scenarios |
41
+ | Anti-Gaming Tests | 12 adversarial exploits proven ineffective |
42
+ | Test Suite | 74 tests across 8 files |
43
+ | Adaptive Depth | Repeat tool calls reveal forensic deep-dive |
44
+ | Dynamic State | Environment reacts to findings and remediations |
45
  | Parameter Randomization | Company, region, version, dates per reset |
46
 
47
+ ## Scenarios
48
 
49
+ ### Easy (2) — Clear-cut systems, focused investigation
50
+ - **Customer Service Chatbot** — Limited-risk. Missing AI disclosure under Article 50. Agent checks transparency and oversight.
51
+ - **Music Recommendation Engine** — Minimal-risk. Voluntary code of conduct recommended. Short investigation path.
 
 
 
 
 
 
 
 
52
 
53
+ ### Medium (3) — Statistical evidence, red herrings, multi-article violations
54
+ - **AI Resume Screener** — High-risk hiring AI (Annex III). 5 findings: gender bias (23% callback gap), insufficient oversight (5% review rate), missing FRIA, incomplete Annex IV docs, data governance gaps.
55
+ - **Credit Scoring Model** — High-risk fintech. Opaque alternative data features (social media, device metadata), no right to human review, missing conformity assessment.
56
+ - **Emergency Triage AI** Medical device dual-regulation (MDR + AI Act). Age bias in 75+ cohort (76.3% sensitivity), retrospective-only validation, no real-time monitoring.
57
+ - **Workplace Emotion Recognition** **PROHIBITED** under Article 5(1)(f). Webcam-based "engagement analytics" that's actually emotion recognition. Deployer frames it as productivity tool — agent must recognize it processes biometric data (facial action units, micro-expressions) without medical/safety exception.
 
58
 
59
+ ### Hard (3) — Disguised systems, compound risks, multi-system dependencies
60
+ - **Citizen Wellness App** — **PROHIBITED** social scoring disguised as voluntary wellness tool. Deployer frames it as gamification, but investigation reveals it controls access to public services based on social behavior scores. Agent must see through the framing.
61
+ - **AI Content Studio** — Deepfake generation platform. Missing all Article 50 content labeling, no C2PA watermarking, no content provenance. Political content generated without disclosure.
62
+ - **Corporate AI Portfolio** — 4 interconnected AI systems sharing a data lake. Agent must identify cross-system data flows amplifying risks, recognize employee sentiment analysis as high-risk, and spot biometric categorization in safety monitoring.
63
 
64
+ ## Procedural Scenario Generator
 
 
65
 
66
+ Beyond the 9 hand-crafted scenarios, a seed-based procedural generator produces **infinite unique scenarios** by combining:
 
 
 
67
 
68
+ - **5 system types**: Drone delivery (critical infrastructure), exam proctoring (education), insurance adjudication (essential services), legal research (limited risk), predictive policing (prohibited)
69
+ - **16 violation templates**: Gender bias, age discrimination, data governance gaps, missing conformity, logging inadequacies, and more
70
+ - **5 red herring templates**: GDPR confusion, compliant sibling systems, ISO certifications, voluntary ethics boards
71
+
72
+ ```python
73
+ # Any seed produces a unique, coherent scenario
74
+ env.reset(scenario_id="procedural_medium_42") # Seed 42, medium difficulty
75
+ env.reset(scenario_id="procedural_hard_12345") # Seed 12345, hard difficulty
76
+ ```
77
+
78
+ Each generated scenario has proper ground truth findings, matching state graph, violation-specific documents, and is fully compatible with the 6-component reward function.
79
+
80
+ ## Tools
81
+
82
+ ### Investigation
83
+ | Tool | Returns |
84
+ |------|---------|
85
+ | `get_system_overview` | Formal audit assignment brief with system description and deployment context |
86
+ | `classify_system` | Records risk classification (prohibited / high_risk / limited_risk / minimal_risk) |
87
+ | `check_documentation` | Annex IV cross-reference table with per-section compliance status |
88
+ | `audit_training_data` | Demographic statistics tables, data governance assessment, bias indicators |
89
+ | `verify_human_oversight` | Operational procedures extract with review statistics and override capabilities |
90
+ | `check_transparency` | User-facing UI/ToS text analysis with Article 50 compliance indicators |
91
+ | `assess_risk_management` | Risk register, conformity assessment tracker, Annex III classification analysis |
92
+ | `check_logging` | Audit log schema, Article 12 requirements gap analysis |
93
+
94
+ ### Resolution
95
+ | Tool | Purpose |
96
+ |------|---------|
97
+ | `submit_finding` | Report a compliance violation (call once per finding) |
98
+ | `recommend_fix` | Propose remediation with priority |
99
+ | `verify_compliance` | Final determination — triggers terminal 6-component reward |
100
 
101
  ## 6-Component Reward
102
 
103
+ | Component | Weight | Anti-Gaming |
104
  |-----------|--------|-------------|
105
+ | Classification | 20% | Adjacent-category partial credit (40%). Wrong by 2+ categories = 0. |
106
+ | Finding Completeness | 25% | Token-based fuzzy matching (Jaccard 40%, min 2 tokens). Prevents keyword stuffing. |
107
+ | Finding Precision | 15% | Red herring submissions penalized 15% each. False positives reduce score. |
108
+ | Remediation Quality | 15% | Presence (70%) + priority ordering (30%). Missing remediation = 0. |
109
+ | Methodology | 15% | Order violations penalized. Skipping investigation tools = 0. |
110
+ | Efficiency | 10% | Fewer steps than optimal = penalty (skipping investigation). More steps = diminishing returns. |
111
+
112
+ All rewards clamped to (0.001, 0.999). 12 adversarial tests prove robustness.
113
 
114
+ ## Architecture
115
+
116
+ ```
117
+ compliance_env/
118
+ server/
119
+ environment.py # MCP environment, 11 tools, dynamic audit state
120
+ engine.py # State graph + 6-component reward computation
121
+ app.py # FastAPI + HTTP session API + Gradio UI
122
+ gradio_landing.py # 7-tab dashboard with investigation depth showcase
123
+ scenarios/
124
+ registry.py # 8 scenarios with 77K+ chars of investigation documents
125
+ tests/
126
+ test_environment.py # 14 environment + API tests
127
+ test_reward_hacking.py # 12 adversarial anti-gaming tests
128
+ test_investigation_depth.py # 10 investigation quality tests
129
+ inference.py # OpenAI function-calling baseline agent
130
+ client.py # Zero-dependency HTTP client
131
+ models.py # Pydantic observation/state models
132
+ Dockerfile # python:3.11-slim, port 7860
133
+ openenv.yaml # OpenEnv manifest with tasks
134
+ ```
135
 
136
  ## Quick Start
137
 
 
142
  # Run locally
143
  uvicorn server.app:app --host 0.0.0.0 --port 7860
144
 
145
+ # Run inference (NVIDIA NIM)
146
  export API_BASE_URL="https://integrate.api.nvidia.com/v1"
147
+ export MODEL_NAME="stepfun-ai/step-3.5-flash"
148
+ export HF_TOKEN="nvapi-..."
149
  python inference.py --space https://Itachi1824-compliance-auditor-env.hf.space
150
 
151
  # Docker
152
  docker build -t compliance-env . && docker run -p 7860:7860 compliance-env
 
 
 
 
 
 
 
 
 
153
 
154
+ # Tests
155
+ pytest tests/ -v
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
  ```
157
 
158
+ ## API Endpoints
 
 
 
 
 
 
 
 
159
 
160
+ | Endpoint | Method | Description |
161
+ |----------|--------|-------------|
162
+ | `/api/reset` | POST | Create session, returns tools + initial observation |
163
+ | `/api/call_tool` | POST | Call an audit tool in an active session |
164
+ | `/api/close` | POST | End session and cleanup |
165
+ | `/tasks` | GET | List available scenarios |
166
+ | `/grader` | POST | Grade a completed episode |
167
+ | `/health` | GET | Health check |
 
 
 
 
 
 
benchmark_leaderboard.py ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Leaderboard benchmark runner — 10 models across 3 NIM API keys.
3
+
4
+ Distributes models across keys to maximize throughput (40 RPM per key).
5
+ Runs all 9 fixed scenarios per model. Saves results to outputs/leaderboard/scores.json.
6
+
7
+ Usage:
8
+ set NVIDIA_API_KEY_1=nvapi-...
9
+ set NVIDIA_API_KEY_2=nvapi-...
10
+ set NVIDIA_API_KEY_3=nvapi-...
11
+ python benchmark_leaderboard.py --space https://Itachi1824-compliance-auditor-env.hf.space
12
+ """
13
+
14
+ import argparse
15
+ import asyncio
16
+ import json
17
+ import os
18
+ import sys
19
+ import time
20
+ from pathlib import Path
21
+ from typing import Dict, List
22
+
23
+ from openai import OpenAI
24
+
25
+ # Import from our inference module
26
+ from inference import run_episode, mcp_tools_to_openai
27
+ from client import ComplianceAuditorHTTP
28
+ from scenarios.registry import SCENARIO_LIST
29
+
30
+ # ---------------------------------------------------------------------------
31
+ # Configuration
32
+ # ---------------------------------------------------------------------------
33
+
34
+ API_BASE = "https://integrate.api.nvidia.com/v1"
35
+
36
+ # 10 models distributed across 3 API keys for parallel execution
37
+ MODEL_GROUPS = [
38
+ # Key 1: Tier S + A models (4 models)
39
+ {
40
+ "key_env": "NVIDIA_API_KEY_1",
41
+ "models": [
42
+ "deepseek-ai/deepseek-v3.1",
43
+ "stepfun-ai/step-3.5-flash",
44
+ "qwen/qwen3.5-122b-a10b",
45
+ "meta/llama-4-scout-17b-16e-instruct",
46
+ ],
47
+ },
48
+ # Key 2: Tier A models (3 models)
49
+ {
50
+ "key_env": "NVIDIA_API_KEY_2",
51
+ "models": [
52
+ "mistralai/mistral-large-3-675b-instruct-2512",
53
+ "google/gemma-4-31b-it",
54
+ "meta/llama-4-maverick-17b-128e-instruct",
55
+ ],
56
+ },
57
+ # Key 3: Tier A/B models (3 models)
58
+ {
59
+ "key_env": "NVIDIA_API_KEY_3",
60
+ "models": [
61
+ "nvidia/llama-3.1-nemotron-ultra-253b-v1",
62
+ "nvidia/nemotron-3-super-120b-a12b",
63
+ "meta/llama-3.3-70b-instruct",
64
+ ],
65
+ },
66
+ ]
67
+
68
+ SCENARIOS = [s["id"] for s in SCENARIO_LIST if not s["id"].startswith("procedural")]
69
+
70
+
71
+ async def benchmark_model(
72
+ model: str,
73
+ api_key: str,
74
+ base_url: str,
75
+ tools: List[Dict],
76
+ ) -> Dict:
77
+ """Run all scenarios for a single model."""
78
+ llm = OpenAI(base_url=API_BASE, api_key=api_key, timeout=120.0)
79
+ results = {}
80
+
81
+ for sid in SCENARIOS:
82
+ difficulty = next(s["difficulty"] for s in SCENARIO_LIST if s["id"] == sid)
83
+ try:
84
+ async with ComplianceAuditorHTTP(base_url=base_url) as env:
85
+ result = await run_episode(env, llm, model, tools, difficulty, sid)
86
+ score = max(0.001, min(0.999, result.get("reward", 0.01)))
87
+ results[sid] = {"score": round(score, 4), "steps": result.get("steps", 0)}
88
+ print(f" {model:50s} | {sid:50s} | score={score:.4f} | steps={result.get('steps', 0)}", flush=True)
89
+ except Exception as e:
90
+ err_msg = str(e)[:80]
91
+ print(f" {model:50s} | {sid:50s} | ERROR: {err_msg}", flush=True)
92
+ results[sid] = {"score": 0.01, "steps": 0, "error": err_msg}
93
+
94
+ # Rate limit: ~2s between episodes to stay under 40 RPM
95
+ await asyncio.sleep(2)
96
+
97
+ return results
98
+
99
+
100
+ async def benchmark_group(
101
+ group: Dict,
102
+ base_url: str,
103
+ tools: List[Dict],
104
+ ) -> List[Dict]:
105
+ """Run all models in a key group sequentially (same API key)."""
106
+ key = os.environ.get(group["key_env"], "")
107
+ if not key:
108
+ print(f"WARNING: {group['key_env']} not set — skipping {len(group['models'])} models", flush=True)
109
+ return []
110
+
111
+ entries = []
112
+ for model in group["models"]:
113
+ print(f"\n{'='*60}", flush=True)
114
+ print(f"BENCHMARKING: {model}", flush=True)
115
+ print(f" Key: {group['key_env']} | Scenarios: {len(SCENARIOS)}", flush=True)
116
+ print(f"{'='*60}", flush=True)
117
+
118
+ start = time.time()
119
+ scores = await benchmark_model(model, key, base_url, tools)
120
+ elapsed = time.time() - start
121
+
122
+ # Compute averages
123
+ all_scores = [v["score"] for v in scores.values() if "error" not in v]
124
+ avg = sum(all_scores) / len(all_scores) if all_scores else 0.0
125
+
126
+ tier_avgs = {}
127
+ for tier in ["easy", "medium", "hard"]:
128
+ tier_scores = [
129
+ v["score"] for sid, v in scores.items()
130
+ if next((s["difficulty"] for s in SCENARIO_LIST if s["id"] == sid), "") == tier
131
+ and "error" not in v
132
+ ]
133
+ tier_avgs[tier] = sum(tier_scores) / len(tier_scores) if tier_scores else 0.0
134
+
135
+ entry = {
136
+ "model": model,
137
+ "scores": scores,
138
+ "overall": round(avg, 4),
139
+ "tier_averages": {k: round(v, 4) for k, v in tier_avgs.items()},
140
+ "elapsed_seconds": round(elapsed, 1),
141
+ }
142
+ entries.append(entry)
143
+
144
+ print(f"\n RESULT: {model}", flush=True)
145
+ print(f" Overall: {avg:.4f}", flush=True)
146
+ for tier, tavg in tier_avgs.items():
147
+ print(f" {tier}: {tavg:.4f}", flush=True)
148
+ print(f" Time: {elapsed:.0f}s", flush=True)
149
+
150
+ return entries
151
+
152
+
153
+ async def main():
154
+ parser = argparse.ArgumentParser(description="Leaderboard benchmark — 10 models")
155
+ parser.add_argument("--space", required=True, help="HF Space URL")
156
+ parser.add_argument("--output", default="outputs/leaderboard/scores.json")
157
+ args = parser.parse_args()
158
+
159
+ base_url = args.space.rstrip("/")
160
+ print(f"Benchmarking against: {base_url}", flush=True)
161
+ print(f"Scenarios: {len(SCENARIOS)}", flush=True)
162
+ print(f"Model groups: {len(MODEL_GROUPS)} ({sum(len(g['models']) for g in MODEL_GROUPS)} total models)", flush=True)
163
+
164
+ # Discover tools from the environment
165
+ async with ComplianceAuditorHTTP(base_url=base_url) as env:
166
+ await env.reset(difficulty="easy")
167
+ tools_raw = await env.list_tools()
168
+ tools = mcp_tools_to_openai(tools_raw)
169
+ print(f"Tools discovered: {len(tools)}", flush=True)
170
+
171
+ # Run all groups in parallel (one per API key)
172
+ tasks = [benchmark_group(g, base_url, tools) for g in MODEL_GROUPS]
173
+ group_results = await asyncio.gather(*tasks)
174
+
175
+ # Flatten and save
176
+ all_entries = []
177
+ for group_entries in group_results:
178
+ all_entries.extend(group_entries)
179
+
180
+ # Sort by overall score descending
181
+ all_entries.sort(key=lambda e: e["overall"], reverse=True)
182
+
183
+ # Save
184
+ output_path = Path(args.output)
185
+ output_path.parent.mkdir(parents=True, exist_ok=True)
186
+ with open(output_path, "w") as f:
187
+ json.dump(all_entries, f, indent=2)
188
+
189
+ print(f"\n{'='*60}", flush=True)
190
+ print("LEADERBOARD RESULTS", flush=True)
191
+ print(f"{'='*60}", flush=True)
192
+ for i, entry in enumerate(all_entries, 1):
193
+ m = entry["model"].split("/")[-1][:30]
194
+ print(f" {i:2d}. {m:30s} | overall={entry['overall']:.4f} | "
195
+ f"easy={entry['tier_averages'].get('easy', 0):.4f} | "
196
+ f"medium={entry['tier_averages'].get('medium', 0):.4f} | "
197
+ f"hard={entry['tier_averages'].get('hard', 0):.4f}",
198
+ flush=True)
199
+
200
+ print(f"\nSaved to {output_path}", flush=True)
201
+
202
+
203
+ if __name__ == "__main__":
204
+ asyncio.run(main())
inference.py CHANGED
@@ -36,24 +36,30 @@ ENV_BASE_URL = os.getenv("OPENENV_BASE_URL") or os.getenv("ENV_URL") or os.geten
36
  MAX_STEPS = 100
37
  CONTEXT_CHAR_LIMIT = 100000
38
 
39
- SYSTEM_PROMPT = """You are an EU AI Act compliance auditor. Complete your audit in UNDER 25 tool calls.
40
-
41
- WORKFLOW (follow this EXACT sequence):
42
- 1. get_system_overview — understand the system
43
- 2. classify_system — set risk_category to: prohibited, high_risk, limited_risk, or minimal_risk
44
- 3. check_documentation review technical docs
45
- 4. audit_training_data check for bias
46
- 5. verify_human_oversight check Article 14
47
- 6. check_transparencycheck Article 50
48
- 7. assess_risk_management check Article 9
49
- 8. check_logging check Article 12
50
- 9. submit_finding report EACH violation you found (one per call)
51
- 10. recommend_fix — propose fix for each finding
52
- 11. verify_compliance — MUST BE YOUR FINAL CALL with risk_classification, overall_assessment, key_findings_summary
53
-
54
- CRITICAL: You MUST call verify_compliance as your last tool call. Without it, you get zero score.
55
- CRITICAL: Always use tool calls, never output plain text.
56
- CRITICAL: For PROHIBITED systems classify as prohibited, submit finding, recommend shutdown, then verify_compliance immediately.
 
 
 
 
 
 
57
  """
58
 
59
 
@@ -346,7 +352,7 @@ async def run_episode(
346
 
347
  BASELINE_SCENARIOS = {
348
  "easy": ["easy_chatbot_transparency_001", "easy_recommendation_minimal_001"],
349
- "medium": ["medium_hiring_bias_001", "medium_credit_scoring_001", "medium_medical_triage_001"],
350
  "hard": ["hard_social_scoring_prohibited_001", "hard_deepfake_generation_001", "hard_multi_system_corporate_001"],
351
  }
352
 
 
36
  MAX_STEPS = 100
37
  CONTEXT_CHAR_LIMIT = 100000
38
 
39
+ SYSTEM_PROMPT = """You are an EU AI Act compliance auditor conducting a formal audit.
40
+
41
+ Your task: Investigate the AI system, identify all compliance violations, and submit a final compliance determination.
42
+
43
+ APPROACH:
44
+ - Start by understanding what the system does and who deploys it
45
+ - Classify the system's risk level under the EU AI Act (prohibited / high_risk / limited_risk / minimal_risk)
46
+ - Investigate relevant areas: documentation, training data, human oversight, transparency, risk management, logging
47
+ - Read the investigation documents CAREFULLY violations are embedded in the evidence, not labeled for you
48
+ - Submit each violation you discover as a separate finding
49
+ - Recommend remediation for each finding
50
+ - End with verify_compliance to submit your final determination
51
+
52
+ KEY PRINCIPLES:
53
+ - PROHIBITED systems (Art. 5): social scoring, real-time biometric ID, manipulation — if detected, classify immediately and recommend shutdown
54
+ - HIGH-RISK systems (Annex III): employment, credit, healthcare, law enforcement require full investigation of all compliance areas
55
+ - LIMITED-RISK (Art. 50): transparency obligations for chatbots, deepfakes focus on disclosure and labeling
56
+ - MINIMAL-RISK: voluntary code of conduct only
57
+
58
+ IMPORTANT:
59
+ - You MUST call verify_compliance as your final action. Without it, you receive no score.
60
+ - Always use tool calls. Never output plain text responses.
61
+ - Red herrings exist in the evidence — not every concern is a real violation.
62
+ - Budget: aim to complete within 25 tool calls.
63
  """
64
 
65
 
 
352
 
353
  BASELINE_SCENARIOS = {
354
  "easy": ["easy_chatbot_transparency_001", "easy_recommendation_minimal_001"],
355
+ "medium": ["medium_hiring_bias_001", "medium_credit_scoring_001", "medium_medical_triage_001", "medium_emotion_recognition_workplace_001"],
356
  "hard": ["hard_social_scoring_prohibited_001", "hard_deepfake_generation_001", "hard_multi_system_corporate_001"],
357
  }
358
 
openenv.yaml CHANGED
@@ -6,11 +6,23 @@ app: server.app:app
6
  port: 7860
7
  tasks:
8
  - id: easy
9
- name: "Easy — Chatbot & Recommendation compliance"
10
  grader: server.engine.compute_reward
 
 
 
11
  - id: medium
12
- name: "Medium — Hiring AI, Credit Scoring, Medical Triage"
13
  grader: server.engine.compute_reward
 
 
 
 
 
14
  - id: hard
15
- name: "Hard — Prohibited Systems, Deepfake, Multi-System Audit"
16
  grader: server.engine.compute_reward
 
 
 
 
 
6
  port: 7860
7
  tasks:
8
  - id: easy
9
+ name: "Easy — Chatbot Transparency & Recommendation Classification"
10
  grader: server.engine.compute_reward
11
+ scenarios:
12
+ - easy_chatbot_transparency_001
13
+ - easy_recommendation_minimal_001
14
  - id: medium
15
+ name: "Medium — Hiring Bias, Credit Scoring, Medical Triage, Emotion Recognition"
16
  grader: server.engine.compute_reward
17
+ scenarios:
18
+ - medium_hiring_bias_001
19
+ - medium_credit_scoring_001
20
+ - medium_medical_triage_001
21
+ - medium_emotion_recognition_workplace_001
22
  - id: hard
23
+ name: "Hard — Prohibited Social Scoring, Deepfake Compliance, Multi-System Audit"
24
  grader: server.engine.compute_reward
25
+ scenarios:
26
+ - hard_social_scoring_prohibited_001
27
+ - hard_deepfake_generation_001
28
+ - hard_multi_system_corporate_001
scenarios/procedural.py ADDED
@@ -0,0 +1,693 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Procedural scenario generator — infinite unique compliance audit scenarios.
3
+
4
+ Combines system type templates, violation templates, and red herring templates
5
+ using seed-based randomization to produce coherent, graded scenarios that are
6
+ unique for every seed. Impossible to memorize.
7
+
8
+ Architecture:
9
+ 1. SystemTemplate — defines a category of AI system (drone delivery, exam proctoring, etc.)
10
+ 2. ViolationTemplate — a specific compliance violation with document injection text
11
+ 3. RedHerringTemplate — misleading information that isn't a real violation
12
+ 4. ProceduralGenerator.generate(seed, difficulty) → AuditScenario
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import random
18
+ from dataclasses import dataclass, field
19
+ from typing import Dict, List, Optional, Tuple
20
+
21
+ from server.engine import AuditScenario, StateGraph, StateNode, Transition
22
+
23
+
24
+ # ---------------------------------------------------------------------------
25
+ # Templates
26
+ # ---------------------------------------------------------------------------
27
+
28
+ @dataclass(frozen=True)
29
+ class SystemTemplate:
30
+ id: str
31
+ name_template: str # e.g. "{company} DroneGuard"
32
+ category: str # prohibited, high_risk, limited_risk, minimal_risk
33
+ annex_ref: str # which Annex III category or article
34
+ description_template: str
35
+ deployer_template: str
36
+ domain_keywords: Tuple[str, ...] = ()
37
+
38
+
39
+ @dataclass(frozen=True)
40
+ class ViolationTemplate:
41
+ id: str
42
+ tool_area: str # documentation, training_data, oversight, transparency, risk_management, logging
43
+ finding_id: str # ground truth finding string
44
+ remediation_id: str # required remediation string
45
+ doc_injection: str # text injected into the relevant document section
46
+ severity: str = "high"
47
+
48
+
49
+ @dataclass(frozen=True)
50
+ class RedHerringTemplate:
51
+ id: str
52
+ tool_area: str # which document section contains it
53
+ doc_injection: str # misleading text
54
+
55
+
56
+ # ---------------------------------------------------------------------------
57
+ # System type pool (5 types covering different AI Act categories)
58
+ # ---------------------------------------------------------------------------
59
+
60
+ SYSTEM_TEMPLATES: List[SystemTemplate] = [
61
+ SystemTemplate(
62
+ id="drone_delivery",
63
+ name_template="{company} SkyRoute Delivery AI",
64
+ category="high_risk",
65
+ annex_ref="Annex III Category 2 — Critical Infrastructure",
66
+ description_template=(
67
+ "Autonomous drone delivery system operating in urban areas across {region}. "
68
+ "AI controls flight path planning, obstacle avoidance, and delivery routing "
69
+ "for {user_count} packages per month. System makes real-time autonomous "
70
+ "decisions affecting public safety in shared airspace."
71
+ ),
72
+ deployer_template="{company} — logistics-tech startup, drone operator license in {region}.",
73
+ domain_keywords=("drone", "airspace", "safety", "autonomous", "delivery"),
74
+ ),
75
+ SystemTemplate(
76
+ id="exam_proctoring",
77
+ name_template="{company} ExamGuard AI",
78
+ category="high_risk",
79
+ annex_ref="Annex III Category 3 — Education and Vocational Training",
80
+ description_template=(
81
+ "AI-powered online exam proctoring system used by {user_count} students "
82
+ "across {region}. Monitors webcam feeds, screen activity, and audio to "
83
+ "detect cheating behavior. Automated flagging can result in exam "
84
+ "invalidation and academic disciplinary proceedings."
85
+ ),
86
+ deployer_template="{company} — EdTech company, serving 200+ universities in {region}.",
87
+ domain_keywords=("exam", "proctoring", "student", "cheating", "academic"),
88
+ ),
89
+ SystemTemplate(
90
+ id="insurance_claims",
91
+ name_template="{company} ClaimIQ Adjudicator",
92
+ category="high_risk",
93
+ annex_ref="Annex III Category 5(a) — Access to Essential Services (Insurance)",
94
+ description_template=(
95
+ "AI system that evaluates and adjudicates insurance claims for health, "
96
+ "property, and vehicle policies. Processes {user_count} claims annually "
97
+ "in {region}. Automated decisions include claim approval, denial, and "
98
+ "payout amount determination up to EUR 100,000."
99
+ ),
100
+ deployer_template="{company} — InsurTech firm, licensed in {region}, {user_count} policyholders.",
101
+ domain_keywords=("insurance", "claim", "adjudication", "payout", "policy"),
102
+ ),
103
+ SystemTemplate(
104
+ id="legal_research",
105
+ name_template="{company} LexAssist AI",
106
+ category="limited_risk",
107
+ annex_ref="Article 50 — Transparency obligations for AI interacting with persons",
108
+ description_template=(
109
+ "AI-powered legal research assistant used by law firms across {region}. "
110
+ "Analyzes case law, statutes, and regulatory documents to provide "
111
+ "research summaries and case strategy suggestions. Used by {user_count} "
112
+ "attorneys. Does not make legal decisions — advisory role only."
113
+ ),
114
+ deployer_template="{company} — LegalTech startup, SaaS platform for law firms in {region}.",
115
+ domain_keywords=("legal", "research", "case law", "attorney", "advisory"),
116
+ ),
117
+ SystemTemplate(
118
+ id="predictive_policing",
119
+ name_template="{company} SafeCity Predictor",
120
+ category="prohibited",
121
+ annex_ref="Article 5(1)(d) — Prohibited: individual predictive policing",
122
+ description_template=(
123
+ "AI system deployed by municipal police in {region} to predict individual "
124
+ "criminal behavior based on personal characteristics, social network "
125
+ "analysis, and neighborhood data. Generates 'risk scores' for {user_count} "
126
+ "residents used to allocate patrol resources and inform stop-and-search decisions."
127
+ ),
128
+ deployer_template="{company} — public safety technology vendor, contracted by 5 municipalities in {region}.",
129
+ domain_keywords=("policing", "prediction", "crime", "risk score", "patrol"),
130
+ ),
131
+ ]
132
+
133
+ # ---------------------------------------------------------------------------
134
+ # Violation pool (20 violations across all tool areas)
135
+ # ---------------------------------------------------------------------------
136
+
137
+ VIOLATION_TEMPLATES: List[ViolationTemplate] = [
138
+ # --- Documentation violations ---
139
+ ViolationTemplate(
140
+ id="missing_annex_iv_sections",
141
+ tool_area="documentation",
142
+ finding_id="incomplete_annex_iv_documentation",
143
+ remediation_id="complete_missing_annex_iv_sections",
144
+ doc_injection=(
145
+ "ANNEX IV COMPLIANCE:\n"
146
+ " Section 2(b) Design specifications: Not documented\n"
147
+ " Section 2(d) Performance per group: Not documented\n"
148
+ " Section 2(g) Quality management: Not documented\n"
149
+ " 3 of 10 required sections are missing."
150
+ ),
151
+ ),
152
+ ViolationTemplate(
153
+ id="stale_documentation",
154
+ tool_area="documentation",
155
+ finding_id="outdated_technical_documentation",
156
+ remediation_id="update_documentation_to_current_version",
157
+ doc_injection=(
158
+ "NOTE: Core technical documentation was last updated 22 months ago\n"
159
+ "(prior to EU AI Act enforcement). It does not reference the AI Act,\n"
160
+ "harmonised standards, or current deployment configuration."
161
+ ),
162
+ ),
163
+ ViolationTemplate(
164
+ id="no_fria",
165
+ tool_area="documentation",
166
+ finding_id="missing_fundamental_rights_impact_assessment",
167
+ remediation_id="conduct_fundamental_rights_impact_assessment",
168
+ doc_injection="Fundamental Rights Impact Assessment: ABSENT — not conducted",
169
+ ),
170
+ # --- Training data violations ---
171
+ ViolationTemplate(
172
+ id="gender_bias",
173
+ tool_area="training_data",
174
+ finding_id="gender_bias_in_automated_decisions",
175
+ remediation_id="conduct_bias_audit_and_mitigation",
176
+ doc_injection=(
177
+ "OUTCOME RATES BY GENDER:\n"
178
+ " Group Rate Delta\n"
179
+ " Male 41.3% (baseline)\n"
180
+ " Female 29.7% -28.1%\n"
181
+ " Non-binary 31.2% -24.5%\n"
182
+ "\n"
183
+ " Statistically significant disparity detected (p < 0.001)."
184
+ ),
185
+ ),
186
+ ViolationTemplate(
187
+ id="age_bias",
188
+ tool_area="training_data",
189
+ finding_id="age_discrimination_in_model_outputs",
190
+ remediation_id="recalibrate_model_for_age_fairness",
191
+ doc_injection=(
192
+ "PERFORMANCE BY AGE GROUP:\n"
193
+ " Age 18-30: accuracy 94.2%\n"
194
+ " Age 31-50: accuracy 91.8%\n"
195
+ " Age 51-65: accuracy 83.4%\n"
196
+ " Age 65+: accuracy 71.9%\n"
197
+ "\n"
198
+ " Performance degrades significantly for older demographics."
199
+ ),
200
+ ),
201
+ ViolationTemplate(
202
+ id="no_data_governance",
203
+ tool_area="training_data",
204
+ finding_id="inadequate_data_governance_framework",
205
+ remediation_id="establish_article_10_data_governance",
206
+ doc_injection=(
207
+ "DATA GOVERNANCE (Article 10):\n"
208
+ " Data quality assessment: Not conducted\n"
209
+ " Bias testing protocol: Not established\n"
210
+ " Data provenance documentation: Incomplete (23 of 47 sources undocumented)\n"
211
+ " Personal data handling: No Article 10-specific provisions"
212
+ ),
213
+ ),
214
+ ViolationTemplate(
215
+ id="consent_issue",
216
+ tool_area="training_data",
217
+ finding_id="invalid_consent_for_training_data",
218
+ remediation_id="obtain_valid_consent_or_remove_data",
219
+ doc_injection=(
220
+ "CONSENT STATUS:\n"
221
+ " Data collected under employer/institutional agreement.\n"
222
+ " Individual subjects did not provide specific consent for AI\n"
223
+ " training. Under EU labor/education law, consent given as a\n"
224
+ " condition of employment/enrollment may not be freely given."
225
+ ),
226
+ ),
227
+ # --- Oversight violations ---
228
+ ViolationTemplate(
229
+ id="low_review_rate",
230
+ tool_area="oversight",
231
+ finding_id="insufficient_human_oversight_of_decisions",
232
+ remediation_id="implement_human_review_for_all_adverse_decisions",
233
+ doc_injection=(
234
+ "REVIEW STATISTICS:\n"
235
+ " Automated decisions: 482,917\n"
236
+ " Adverse outcomes: 144,875 (30.0%)\n"
237
+ " Human-reviewed: 7,244 (5.0% of adverse)\n"
238
+ " Review overrides: 362 (5.0% of reviews)\n"
239
+ "\n"
240
+ " 95% of adverse decisions receive no human review."
241
+ ),
242
+ ),
243
+ ViolationTemplate(
244
+ id="no_override",
245
+ tool_area="oversight",
246
+ finding_id="no_meaningful_override_capability",
247
+ remediation_id="implement_accessible_override_mechanism",
248
+ doc_injection=(
249
+ "OVERRIDE CAPABILITY:\n"
250
+ " Technical override exists in admin panel but is not accessible\n"
251
+ " to frontline operators. Override requires supervisor approval\n"
252
+ " and written justification. Average override processing time:\n"
253
+ " 3.2 business days. Affected individuals cannot request override."
254
+ ),
255
+ ),
256
+ ViolationTemplate(
257
+ id="no_bias_monitoring",
258
+ tool_area="oversight",
259
+ finding_id="no_ongoing_bias_monitoring",
260
+ remediation_id="implement_continuous_fairness_monitoring",
261
+ doc_injection=(
262
+ "BIAS MONITORING:\n"
263
+ " No automated fairness monitoring system in place.\n"
264
+ " No alerts configured for demographic drift.\n"
265
+ " Last manual fairness review: 14 months ago."
266
+ ),
267
+ ),
268
+ # --- Transparency violations ---
269
+ ViolationTemplate(
270
+ id="missing_ai_disclosure",
271
+ tool_area="transparency",
272
+ finding_id="missing_ai_system_disclosure",
273
+ remediation_id="implement_clear_ai_disclosure",
274
+ doc_injection=(
275
+ "USER-FACING DISCLOSURE AUDIT:\n"
276
+ " Application interface: No AI mention\n"
277
+ " Terms of Service: Generic 'automated tools' reference (Section 7)\n"
278
+ " Privacy Policy: No specific AI disclosure\n"
279
+ " Decision notifications: No mention of AI involvement\n"
280
+ "\n"
281
+ " Article 50(1) requires informing persons they interact with AI."
282
+ ),
283
+ ),
284
+ ViolationTemplate(
285
+ id="no_explanation",
286
+ tool_area="transparency",
287
+ finding_id="no_right_to_explanation_mechanism",
288
+ remediation_id="implement_individualized_explanations",
289
+ doc_injection=(
290
+ "RIGHT TO EXPLANATION:\n"
291
+ " No mechanism for affected individuals to request explanation\n"
292
+ " of AI-assisted decisions. Support team provides templated\n"
293
+ " responses listing generic factors, not individual-specific\n"
294
+ " reasoning."
295
+ ),
296
+ ),
297
+ # --- Risk management violations ---
298
+ ViolationTemplate(
299
+ id="no_conformity",
300
+ tool_area="risk_management",
301
+ finding_id="missing_conformity_assessment",
302
+ remediation_id="complete_conformity_assessment_procedure",
303
+ doc_injection=(
304
+ "CONFORMITY ASSESSMENT:\n"
305
+ " Internal assessment (Article 43): Not initiated\n"
306
+ " EU Declaration of Conformity: Not filed\n"
307
+ " CE marking: Not applied\n"
308
+ " Quality management system: Does not meet Article 17"
309
+ ),
310
+ ),
311
+ ViolationTemplate(
312
+ id="no_post_market",
313
+ tool_area="risk_management",
314
+ finding_id="missing_post_market_monitoring",
315
+ remediation_id="establish_post_market_monitoring_plan",
316
+ doc_injection=(
317
+ "POST-MARKET MONITORING (Article 72):\n"
318
+ " Post-market monitoring plan: ABSENT\n"
319
+ " Incident reporting procedures: Not established\n"
320
+ " Corrective action criteria: Not defined"
321
+ ),
322
+ ),
323
+ # --- Logging violations ---
324
+ ViolationTemplate(
325
+ id="incomplete_logging",
326
+ tool_area="logging",
327
+ finding_id="inadequate_automatic_logging",
328
+ remediation_id="implement_article_12_compliant_logging",
329
+ doc_injection=(
330
+ "ARTICLE 12 COMPLIANCE:\n"
331
+ " Input features logged: No\n"
332
+ " Confidence scores logged: No\n"
333
+ " Decision rationale logged: No\n"
334
+ " Demographic tracking: No\n"
335
+ "\n"
336
+ " Only final decisions and timestamps are recorded.\n"
337
+ " Cannot reconstruct why specific decisions were made."
338
+ ),
339
+ ),
340
+ ViolationTemplate(
341
+ id="short_retention",
342
+ tool_area="logging",
343
+ finding_id="insufficient_log_retention_period",
344
+ remediation_id="extend_retention_to_regulatory_minimum",
345
+ doc_injection=(
346
+ "LOG RETENTION:\n"
347
+ " Decision logs: 30 days (regulatory minimum varies by sector)\n"
348
+ " Audit trails: 30 days\n"
349
+ " Model versions: 90 days\n"
350
+ "\n"
351
+ " For high-risk systems, retention periods should support\n"
352
+ " retrospective audits and regulatory investigations."
353
+ ),
354
+ ),
355
+ ]
356
+
357
+ # ---------------------------------------------------------------------------
358
+ # Red herring pool
359
+ # ---------------------------------------------------------------------------
360
+
361
+ RED_HERRING_TEMPLATES: List[RedHerringTemplate] = [
362
+ RedHerringTemplate(
363
+ id="gdpr_separate",
364
+ tool_area="training_data",
365
+ doc_injection=(
366
+ "NOTE: A separate GDPR compliance audit was conducted in 2025-Q2\n"
367
+ "and identified no critical findings. GDPR compliance is assessed\n"
368
+ "independently from EU AI Act obligations."
369
+ ),
370
+ ),
371
+ RedHerringTemplate(
372
+ id="separate_compliant_system",
373
+ tool_area="training_data",
374
+ doc_injection=(
375
+ "ADDITIONAL CONTEXT: The organization operates a separate AI system\n"
376
+ "for internal reporting that shares the same data infrastructure.\n"
377
+ "That system passed its own compliance audit (2025-08) with no findings.\n"
378
+ "Its compliance has no bearing on this system's obligations."
379
+ ),
380
+ ),
381
+ RedHerringTemplate(
382
+ id="iso_certification",
383
+ tool_area="documentation",
384
+ doc_injection=(
385
+ "The deployer holds ISO 27001 and ISO 42001 certifications.\n"
386
+ "These certifications cover information security and AI management\n"
387
+ "respectively, but do not constitute EU AI Act conformity assessment."
388
+ ),
389
+ ),
390
+ RedHerringTemplate(
391
+ id="voluntary_ethics_board",
392
+ tool_area="oversight",
393
+ doc_injection=(
394
+ "The deployer has established a voluntary AI Ethics Advisory Board\n"
395
+ "that meets quarterly to review ethical considerations.\n"
396
+ "The board's recommendations are non-binding and do not substitute\n"
397
+ "for the mandatory human oversight requirements of Article 14."
398
+ ),
399
+ ),
400
+ RedHerringTemplate(
401
+ id="high_accuracy_claim",
402
+ tool_area="risk_management",
403
+ doc_injection=(
404
+ "The deployer emphasizes that the system achieves 96.3% overall\n"
405
+ "accuracy on the test benchmark, exceeding industry standards.\n"
406
+ "However, aggregate accuracy does not address per-group performance\n"
407
+ "or the specific risk management requirements of Article 9."
408
+ ),
409
+ ),
410
+ ]
411
+
412
+
413
+ # ---------------------------------------------------------------------------
414
+ # Document templates per tool area
415
+ # ---------------------------------------------------------------------------
416
+
417
+ def _base_doc_template(area: str) -> str:
418
+ """Base document structure for each investigation tool area."""
419
+ templates = {
420
+ "documentation": (
421
+ "TECHNICAL DOCUMENTATION INVENTORY\n"
422
+ "System: {system_name} {version}\n"
423
+ "Deployer: {deployer}\n"
424
+ "Audit Date: {date}\n\n"
425
+ "DOCUMENT STATUS:\n"
426
+ " System Architecture Present (updated 2025-10-15)\n"
427
+ " Algorithm Description Present (updated 2025-09-22)\n"
428
+ " Input/Output Specification Present (updated 2025-10-15)\n"
429
+ " Performance Metrics (Aggregate) Present (updated 2025-11-01)\n"
430
+ " Data Source Inventory Present (updated 2025-07-18)\n"
431
+ " Change Management Log Present (updated 2025-11-28)\n\n"
432
+ "{violations}\n\n"
433
+ "{red_herrings}"
434
+ ),
435
+ "training_data": (
436
+ "TRAINING DATA AUDIT REPORT\n"
437
+ "System: {system_name} {version}\n"
438
+ "Records: {record_count} entries ({data_period})\n\n"
439
+ "DEMOGRAPHIC DISTRIBUTION:\n"
440
+ " Category Count Pct\n"
441
+ " Male {male_count} {male_pct}%\n"
442
+ " Female {female_count} {female_pct}%\n"
443
+ " Age 18-35 {young_count} {young_pct}%\n"
444
+ " Age 36-55 {mid_count} {mid_pct}%\n"
445
+ " Age 56+ {old_count} {old_pct}%\n\n"
446
+ "{violations}\n\n"
447
+ "DATA SOURCES:\n"
448
+ " {data_source_1}\n"
449
+ " {data_source_2}\n\n"
450
+ "{red_herrings}"
451
+ ),
452
+ "oversight": (
453
+ "HUMAN OVERSIGHT PROCEDURES\n"
454
+ "System: {system_name} {version}\n"
455
+ "Department: Operations\n\n"
456
+ "DECISION WORKFLOW:\n"
457
+ " 1. Input data received and preprocessed\n"
458
+ " 2. AI model generates recommendation/decision\n"
459
+ " 3. Output delivered to end-user or downstream system\n\n"
460
+ "{violations}\n\n"
461
+ "{red_herrings}"
462
+ ),
463
+ "transparency": (
464
+ "TRANSPARENCY & DISCLOSURE REVIEW\n"
465
+ "System: {system_name} {version}\n\n"
466
+ "USER-FACING COMMUNICATIONS:\n"
467
+ " The system's user interface and documentation were reviewed\n"
468
+ " for compliance with EU AI Act transparency obligations.\n\n"
469
+ "{violations}\n\n"
470
+ "{red_herrings}"
471
+ ),
472
+ "risk_management": (
473
+ "RISK MANAGEMENT & CONFORMITY ASSESSMENT\n"
474
+ "System: {system_name} {version}\n\n"
475
+ "ANNEX III CLASSIFICATION:\n"
476
+ " {annex_ref}\n\n"
477
+ "RISK LEVEL DETERMINATION: {risk_level}\n\n"
478
+ "{violations}\n\n"
479
+ "{red_herrings}"
480
+ ),
481
+ "logging": (
482
+ "LOGGING & TRACEABILITY REVIEW\n"
483
+ "System: {system_name} {version}\n\n"
484
+ "CURRENT LOGGING IMPLEMENTATION:\n"
485
+ " Event Type Logged Retention\n"
486
+ " Application received Yes {retention}\n"
487
+ " Decision generated Yes {retention}\n"
488
+ " Model version Yes Indefinite\n\n"
489
+ "{violations}\n\n"
490
+ "{red_herrings}"
491
+ ),
492
+ }
493
+ return templates.get(area, "")
494
+
495
+
496
+ # ---------------------------------------------------------------------------
497
+ # Procedural generator
498
+ # ---------------------------------------------------------------------------
499
+
500
+ # Difficulty → violation count range
501
+ DIFFICULTY_VIOLATION_RANGE = {
502
+ "easy": (1, 2),
503
+ "medium": (3, 5),
504
+ "hard": (4, 6),
505
+ }
506
+
507
+ DIFFICULTY_RED_HERRING_RANGE = {
508
+ "easy": (0, 1),
509
+ "medium": (1, 2),
510
+ "hard": (2, 3),
511
+ }
512
+
513
+
514
+ def _build_procedural_graph(
515
+ investigation_tools: List[str],
516
+ is_prohibited: bool = False,
517
+ ) -> StateGraph:
518
+ """Build state graph for a procedural scenario (same logic as registry)."""
519
+ # Import the shared graph builder
520
+ from scenarios.registry import _build_scenario_graph
521
+ return _build_scenario_graph(investigation_tools, is_prohibited)
522
+
523
+
524
+ def generate_procedural_scenario(
525
+ seed: int,
526
+ difficulty: str = "medium",
527
+ ) -> AuditScenario:
528
+ """Generate a unique compliance audit scenario from seed.
529
+
530
+ Every seed produces a different combination of system type, violations,
531
+ red herrings, and document content. The ground truth, state graph, and
532
+ reward computation are all coherent and valid.
533
+
534
+ Args:
535
+ seed: Random seed for reproducible generation.
536
+ difficulty: "easy", "medium", or "hard".
537
+
538
+ Returns:
539
+ A fully populated AuditScenario ready for use.
540
+ """
541
+ rng = random.Random(seed)
542
+
543
+ # 1. Pick system type
544
+ if difficulty == "easy":
545
+ candidates = [s for s in SYSTEM_TEMPLATES if s.category in ("limited_risk", "minimal_risk")]
546
+ if not candidates:
547
+ candidates = [s for s in SYSTEM_TEMPLATES if s.category == "limited_risk"]
548
+ elif difficulty == "hard":
549
+ candidates = [s for s in SYSTEM_TEMPLATES if s.category in ("prohibited", "high_risk")]
550
+ else:
551
+ candidates = list(SYSTEM_TEMPLATES)
552
+ system = rng.choice(candidates)
553
+
554
+ # 2. Pick violations
555
+ min_v, max_v = DIFFICULTY_VIOLATION_RANGE[difficulty]
556
+ n_violations = rng.randint(min_v, max_v)
557
+ available_violations = list(VIOLATION_TEMPLATES)
558
+ rng.shuffle(available_violations)
559
+ violations = available_violations[:n_violations]
560
+
561
+ # 3. Pick red herrings
562
+ min_r, max_r = DIFFICULTY_RED_HERRING_RANGE[difficulty]
563
+ n_red_herrings = rng.randint(min_r, max_r)
564
+ available_red_herrings = list(RED_HERRING_TEMPLATES)
565
+ rng.shuffle(available_red_herrings)
566
+ red_herrings = available_red_herrings[:n_red_herrings]
567
+
568
+ # 4. Generate randomized parameters
569
+ company_names = [
570
+ "TechNova Solutions", "QuantumLeap AI", "NeuralPath Inc",
571
+ "DataForge Systems", "CogniTech Labs", "AlphaWave AI",
572
+ "SynthMind Corp", "PrismAI Technologies", "Vertex Analytics",
573
+ "OmniSense AI", "DeepCurrent Inc", "StrataLogic Systems",
574
+ "AeroMind Labs", "CyberPulse Inc", "InnoVista AI",
575
+ ]
576
+ regions = ["EU-West (DE/FR/NL)", "EU-Central (DE/AT/CH)", "EU-North (SE/FI/DK)",
577
+ "EU-South (IT/ES/PT)", "EU-East (PL/CZ/RO)"]
578
+
579
+ company = rng.choice(company_names)
580
+ region = rng.choice(regions)
581
+ version = f"v{rng.randint(1,6)}.{rng.randint(0,9)}"
582
+ date = f"2026-{rng.randint(1,3):02d}-{rng.randint(1,28):02d}"
583
+ user_count = f"{rng.randint(10000, 5000000):,}"
584
+
585
+ system_name = system.name_template.format(company=company)
586
+ deployer = system.deployer_template.format(
587
+ company=company, region=region, user_count=user_count
588
+ )
589
+ description = system.description_template.format(
590
+ company=company, region=region, user_count=user_count
591
+ )
592
+
593
+ # 5. Group violations and red herrings by tool area
594
+ area_violations: Dict[str, List[str]] = {}
595
+ area_red_herrings: Dict[str, List[str]] = {}
596
+
597
+ for v in violations:
598
+ area_violations.setdefault(v.tool_area, []).append(v.doc_injection)
599
+ for r in red_herrings:
600
+ area_red_herrings.setdefault(r.tool_area, []).append(r.doc_injection)
601
+
602
+ # 6. Generate documents
603
+ fill_params = {
604
+ "system_name": system_name,
605
+ "version": version,
606
+ "deployer": deployer,
607
+ "date": date,
608
+ "annex_ref": system.annex_ref,
609
+ "risk_level": system.category.replace("_", " ").title(),
610
+ "record_count": f"{rng.randint(100000, 5000000):,}",
611
+ "data_period": f"20{rng.randint(19,23)}-2025",
612
+ "male_count": f"{rng.randint(400000, 800000):,}",
613
+ "male_pct": f"{rng.uniform(55, 68):.1f}",
614
+ "female_count": f"{rng.randint(200000, 500000):,}",
615
+ "female_pct": f"{rng.uniform(32, 45):.1f}",
616
+ "young_count": f"{rng.randint(200000, 400000):,}",
617
+ "young_pct": f"{rng.uniform(28, 40):.1f}",
618
+ "mid_count": f"{rng.randint(300000, 500000):,}",
619
+ "mid_pct": f"{rng.uniform(35, 48):.1f}",
620
+ "old_count": f"{rng.randint(50000, 200000):,}",
621
+ "old_pct": f"{rng.uniform(12, 25):.1f}",
622
+ "data_source_1": f"Primary: {rng.choice(['Enterprise API exports', 'Partner platform data', 'Direct user submissions'])}",
623
+ "data_source_2": f"Secondary: {rng.choice(['Public datasets (filtered)', 'Licensed commercial data', 'Internal test data'])}",
624
+ "retention": rng.choice(["5 years", "7 years", "3 years", "10 years"]),
625
+ }
626
+
627
+ def _build_doc(area: str) -> str:
628
+ template = _base_doc_template(area)
629
+ v_text = "\n\n".join(area_violations.get(area, ["(No issues identified in this area.)"]))
630
+ r_text = "\n\n".join(area_red_herrings.get(area, [""]))
631
+ filled = template.format(violations=v_text, red_herrings=r_text, **fill_params)
632
+ return filled
633
+
634
+ docs = {
635
+ "documentation_data": _build_doc("documentation"),
636
+ "training_data_info": _build_doc("training_data"),
637
+ "oversight_info": _build_doc("oversight"),
638
+ "transparency_info": _build_doc("transparency"),
639
+ "risk_assessment_info": _build_doc("risk_management"),
640
+ "logging_info": _build_doc("logging"),
641
+ }
642
+
643
+ # 7. Determine investigation tools (areas that have violations)
644
+ affected_areas = set(v.tool_area for v in violations)
645
+ tool_map = {
646
+ "documentation": "check_documentation",
647
+ "training_data": "audit_training_data",
648
+ "oversight": "verify_human_oversight",
649
+ "transparency": "check_transparency",
650
+ "risk_management": "assess_risk_management",
651
+ "logging": "check_logging",
652
+ }
653
+ investigation_tools = [tool_map[a] for a in [
654
+ "documentation", "training_data", "oversight",
655
+ "transparency", "risk_management", "logging"
656
+ ] if a in affected_areas]
657
+
658
+ # Ensure at least 2 investigation tools for meaningful audit
659
+ if len(investigation_tools) < 2:
660
+ extras = ["check_documentation", "check_transparency"]
661
+ for e in extras:
662
+ if e not in investigation_tools:
663
+ investigation_tools.append(e)
664
+ if len(investigation_tools) >= 2:
665
+ break
666
+
667
+ # 8. Build the scenario
668
+ scenario = AuditScenario(
669
+ scenario_id=f"procedural_{difficulty}_{seed:06d}",
670
+ title=f"Procedural: {system_name} ({difficulty.title()})",
671
+ difficulty=difficulty,
672
+ description=description,
673
+ system_name=system_name,
674
+ system_description=description,
675
+ system_category=system.category,
676
+ deployer_info=deployer,
677
+ correct_classification=system.category,
678
+ ground_truth_findings=[v.finding_id for v in violations],
679
+ required_remediation=[v.remediation_id for v in violations],
680
+ red_herrings=[r.id for r in red_herrings],
681
+ **docs,
682
+ )
683
+
684
+ # 9. Build state graph
685
+ scenario.graph = _build_procedural_graph(
686
+ investigation_tools=investigation_tools,
687
+ is_prohibited=(system.category == "prohibited"),
688
+ )
689
+
690
+ # 10. Randomize (adds company/region/version params)
691
+ scenario.randomize(seed)
692
+
693
+ return scenario
scenarios/registry.py CHANGED
The diff for this file is too large to render. See raw diff
 
server/engine.py CHANGED
@@ -149,13 +149,21 @@ class AuditScenario:
149
  required_remediation: List[str] = field(default_factory=list)
150
  red_herrings: List[str] = field(default_factory=list)
151
 
152
- # Tool-specific data (returned when agent calls tools)
153
- documentation_data: Dict[str, Any] = field(default_factory=dict)
154
- training_data_info: Dict[str, Any] = field(default_factory=dict)
155
- oversight_info: Dict[str, Any] = field(default_factory=dict)
156
- transparency_info: Dict[str, Any] = field(default_factory=dict)
157
- risk_assessment_info: Dict[str, Any] = field(default_factory=dict)
158
- logging_info: Dict[str, Any] = field(default_factory=dict)
 
 
 
 
 
 
 
 
159
 
160
  # Randomization parameters (re-rolled on each reset)
161
  _rand_params: Dict[str, str] = field(default_factory=dict)
@@ -176,6 +184,9 @@ class AuditScenario:
176
  "company": rng.choice(company_names),
177
  "region": rng.choice(regions),
178
  "version": rng.choice(versions),
 
 
 
179
  "deployment_date": f"2026-{rng.randint(1,3):02d}-{rng.randint(1,28):02d}",
180
  "user_count": str(rng.randint(10000, 5000000)),
181
  }
 
149
  required_remediation: List[str] = field(default_factory=list)
150
  red_herrings: List[str] = field(default_factory=list)
151
 
152
+ # Investigation documents (rich text requiring analysis — no pre-digested verdicts)
153
+ documentation_data: str = ""
154
+ training_data_info: str = ""
155
+ oversight_info: str = ""
156
+ transparency_info: str = ""
157
+ risk_assessment_info: str = ""
158
+ logging_info: str = ""
159
+
160
+ # Deep-dive documents (revealed on repeat tool calls — adaptive depth)
161
+ deep_documentation: str = ""
162
+ deep_training_data: str = ""
163
+ deep_oversight: str = ""
164
+ deep_transparency: str = ""
165
+ deep_risk_assessment: str = ""
166
+ deep_logging: str = ""
167
 
168
  # Randomization parameters (re-rolled on each reset)
169
  _rand_params: Dict[str, str] = field(default_factory=dict)
 
184
  "company": rng.choice(company_names),
185
  "region": rng.choice(regions),
186
  "version": rng.choice(versions),
187
+ "date": f"2026-{rng.randint(1,3):02d}-{rng.randint(1,28):02d}",
188
+ "usercount": f"{rng.randint(10000, 5000000):,}",
189
+ # Keep old keys for backwards compat with get_param()
190
  "deployment_date": f"2026-{rng.randint(1,3):02d}-{rng.randint(1,28):02d}",
191
  "user_count": str(rng.randint(10000, 5000000)),
192
  }
server/environment.py CHANGED
@@ -1,11 +1,18 @@
1
  """
2
  EU AI Act Compliance Auditor — MCP Environment.
3
 
4
- Registers 10 MCP tools that the agent uses to audit AI systems for EU AI Act
5
- compliance. State-graph tracks audit progress. Terminal reward computed on
6
- verify_compliance with 6-component scoring.
7
-
8
- Tools:
 
 
 
 
 
 
 
9
  Investigation: get_system_overview, classify_system, check_documentation,
10
  audit_training_data, verify_human_oversight, check_transparency,
11
  assess_risk_management, check_logging
@@ -73,6 +80,7 @@ class ComplianceAuditorEnvironment(Environment):
73
  self._findings_submitted: List[str] = []
74
  self._remediation_submitted: List[str] = []
75
  self._discovered_info: Dict[str, bool] = {}
 
76
 
77
  # Progress tracking for state graph
78
  self._max_progress_depth: int = 0
@@ -251,6 +259,87 @@ class ComplianceAuditorEnvironment(Environment):
251
  step_count=self._step_count,
252
  )
253
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
  # ------------------------------------------------------------------
255
  # Tool implementations
256
  # ------------------------------------------------------------------
@@ -313,19 +402,32 @@ class ComplianceAuditorEnvironment(Environment):
313
  outcome = self._advance_state("get_system_overview")
314
 
315
  s = self._scenario
316
- result = {
317
- "system_name": s.system_name,
318
- "description": s.system_description,
319
- "deployer": s.deployer_info,
320
- "category_claim": s.system_category if s.difficulty == "easy" else "To be determined by auditor",
321
- "deployment_date": s.get_param("deployment_date"),
322
- "region": s.get_param("region"),
323
- "user_count": s.get_param("user_count"),
324
- "company": s.get_param("company"),
325
- "version": s.get_param("version"),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
326
  "queries_remaining": QUERY_BUDGET - self._queries_used,
327
- }
328
- return json.dumps(result, indent=2)
329
 
330
  def _tool_classify_system(self, risk_category: str) -> str:
331
  budget_err = self._use_query()
@@ -350,17 +452,105 @@ class ComplianceAuditorEnvironment(Environment):
350
  "queries_remaining": QUERY_BUDGET - self._queries_used,
351
  })
352
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
353
  def _tool_check_documentation(self) -> str:
354
  budget_err = self._use_query()
355
  if budget_err:
356
  return budget_err
357
  self._discovered_info["documentation"] = True
358
  self._observation_after_investigation += 1
359
- outcome = self._advance_state("check_documentation")
360
- return json.dumps({
361
- "documentation_review": self._scenario.documentation_data,
362
- "queries_remaining": QUERY_BUDGET - self._queries_used,
363
- }, indent=2)
364
 
365
  def _tool_audit_training_data(self) -> str:
366
  budget_err = self._use_query()
@@ -368,11 +558,8 @@ class ComplianceAuditorEnvironment(Environment):
368
  return budget_err
369
  self._discovered_info["training_data"] = True
370
  self._observation_after_investigation += 1
371
- outcome = self._advance_state("audit_training_data")
372
- return json.dumps({
373
- "training_data_audit": self._scenario.training_data_info,
374
- "queries_remaining": QUERY_BUDGET - self._queries_used,
375
- }, indent=2)
376
 
377
  def _tool_verify_human_oversight(self) -> str:
378
  budget_err = self._use_query()
@@ -380,11 +567,8 @@ class ComplianceAuditorEnvironment(Environment):
380
  return budget_err
381
  self._discovered_info["oversight"] = True
382
  self._observation_after_investigation += 1
383
- outcome = self._advance_state("verify_human_oversight")
384
- return json.dumps({
385
- "oversight_assessment": self._scenario.oversight_info,
386
- "queries_remaining": QUERY_BUDGET - self._queries_used,
387
- }, indent=2)
388
 
389
  def _tool_check_transparency(self) -> str:
390
  budget_err = self._use_query()
@@ -392,11 +576,8 @@ class ComplianceAuditorEnvironment(Environment):
392
  return budget_err
393
  self._discovered_info["transparency"] = True
394
  self._observation_after_investigation += 1
395
- outcome = self._advance_state("check_transparency")
396
- return json.dumps({
397
- "transparency_assessment": self._scenario.transparency_info,
398
- "queries_remaining": QUERY_BUDGET - self._queries_used,
399
- }, indent=2)
400
 
401
  def _tool_assess_risk_management(self) -> str:
402
  budget_err = self._use_query()
@@ -404,11 +585,8 @@ class ComplianceAuditorEnvironment(Environment):
404
  return budget_err
405
  self._discovered_info["risk_management"] = True
406
  self._observation_after_investigation += 1
407
- outcome = self._advance_state("assess_risk_management")
408
- return json.dumps({
409
- "risk_assessment": self._scenario.risk_assessment_info,
410
- "queries_remaining": QUERY_BUDGET - self._queries_used,
411
- }, indent=2)
412
 
413
  def _tool_check_logging(self) -> str:
414
  budget_err = self._use_query()
@@ -416,11 +594,8 @@ class ComplianceAuditorEnvironment(Environment):
416
  return budget_err
417
  self._discovered_info["logging"] = True
418
  self._observation_after_investigation += 1
419
- outcome = self._advance_state("check_logging")
420
- return json.dumps({
421
- "logging_assessment": self._scenario.logging_info,
422
- "queries_remaining": QUERY_BUDGET - self._queries_used,
423
- }, indent=2)
424
 
425
  def _tool_submit_finding(self, finding: str, severity: str = "high") -> str:
426
  budget_err = self._use_query()
@@ -428,12 +603,58 @@ class ComplianceAuditorEnvironment(Environment):
428
  return budget_err
429
  self._findings_submitted.append(finding.lower().strip())
430
  outcome = self._advance_state("submit_finding")
431
- return json.dumps({
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
432
  "finding_recorded": finding,
433
  "severity": severity,
434
  "total_findings": len(self._findings_submitted),
435
  "queries_remaining": QUERY_BUDGET - self._queries_used,
436
- })
 
 
 
437
 
438
  def _tool_recommend_fix(self, finding: str, remediation: str, priority: int = 1) -> str:
439
  budget_err = self._use_query()
@@ -473,16 +694,53 @@ class ComplianceAuditorEnvironment(Environment):
473
 
474
  self._reward = breakdown.total()
475
 
476
- return json.dumps({
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
477
  "done": True,
478
  "reward": self._reward,
479
- "assessment_recorded": overall_assessment[:200],
480
  "reward_breakdown": breakdown.to_dict(),
481
- "findings_submitted": len(self._findings_submitted),
482
- "remediations_submitted": len(self._remediation_submitted),
483
- "queries_used": self._queries_used,
484
- "episode_duration_seconds": round(time.time() - self._start_time, 1),
485
- }, indent=2)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
486
 
487
  def close(self) -> None:
488
  pass
 
1
  """
2
  EU AI Act Compliance Auditor — MCP Environment.
3
 
4
+ Investigation-grade environment where LLM agents audit AI systems for EU AI Act
5
+ compliance. Tools return realistic regulatory documents (30-70 lines each) requiring
6
+ genuine analysis — no pre-digested verdicts.
7
+
8
+ Key features:
9
+ - Adaptive depth: repeat tool calls reveal forensic deep-dive content
10
+ - Dynamic state: environment responds to findings and remediation proposals
11
+ - Evidence chain validation: warns when findings lack supporting investigation
12
+ - 6-component terminal reward with anti-gaming (12 adversarial tests proven)
13
+ - 6 unique state graph topologies across 9 scenarios
14
+
15
+ Tools (11):
16
  Investigation: get_system_overview, classify_system, check_documentation,
17
  audit_training_data, verify_human_oversight, check_transparency,
18
  assess_risk_management, check_logging
 
80
  self._findings_submitted: List[str] = []
81
  self._remediation_submitted: List[str] = []
82
  self._discovered_info: Dict[str, bool] = {}
83
+ self._tool_call_counts: Dict[str, int] = {} # track repeat calls per tool
84
 
85
  # Progress tracking for state graph
86
  self._max_progress_depth: int = 0
 
259
  step_count=self._step_count,
260
  )
261
 
262
+ # ------------------------------------------------------------------
263
+ # Document rendering
264
+ # ------------------------------------------------------------------
265
+
266
+ def _render_doc(self, template: str) -> str:
267
+ """Replace __PLACEHOLDER__ tokens with randomized scenario params
268
+ and inject seed-based noise for truly unique documents per episode."""
269
+ result = template
270
+ if self._scenario and self._scenario._rand_params:
271
+ for key, val in self._scenario._rand_params.items():
272
+ result = result.replace(f"__{key.upper()}__", str(val))
273
+
274
+ # Seed-based noise injection: vary specific numbers slightly
275
+ # so no two episodes produce identical documents
276
+ if self._scenario:
277
+ rng = random.Random(hash(self._episode_id))
278
+ result = self._inject_noise(result, rng)
279
+ return result
280
+
281
+ def _inject_noise(self, text: str, rng: random.Random) -> str:
282
+ """Inject seed-based perturbations into document text.
283
+
284
+ Varies percentages and counts slightly (within realistic ranges)
285
+ to ensure every episode is genuinely unique, not just parameter swaps.
286
+ The violations remain detectable but exact numbers change.
287
+ """
288
+ import re
289
+
290
+ def _perturb_pct(match: re.Match) -> str:
291
+ """Perturb a percentage value by +-2 percentage points."""
292
+ val = float(match.group(1))
293
+ delta = rng.uniform(-2.0, 2.0)
294
+ new_val = max(0.1, min(99.9, val + delta))
295
+ return f"{new_val:.1f}%"
296
+
297
+ def _perturb_count(match: re.Match) -> str:
298
+ """Perturb a large count by +-5%."""
299
+ val = int(match.group(1).replace(",", ""))
300
+ if val < 100:
301
+ return match.group(0) # don't perturb small numbers
302
+ delta = rng.uniform(-0.05, 0.05)
303
+ new_val = int(val * (1 + delta))
304
+ if val >= 1000:
305
+ return f"{new_val:,}"
306
+ return str(new_val)
307
+
308
+ # Perturb percentages (e.g., "34.2%" -> "35.1%")
309
+ text = re.sub(r'(\d{1,2}\.\d)%', _perturb_pct, text)
310
+
311
+ # Perturb large counts (e.g., "1,342,104" -> "1,378,921")
312
+ text = re.sub(r'(\d{1,3}(?:,\d{3})+)', _perturb_count, text)
313
+
314
+ return text
315
+
316
+ def _audit_progress_section(self) -> str:
317
+ """Dynamic audit progress appended to tool responses.
318
+
319
+ After the agent starts submitting findings, this section appears
320
+ in subsequent tool responses, showing what's been found so far.
321
+ Makes the environment feel responsive and alive.
322
+ """
323
+ parts = []
324
+ if self._classification_submitted:
325
+ parts.append(f" Classification submitted: {self._classification_submitted.replace('_', ' ').title()}")
326
+ if self._findings_submitted:
327
+ parts.append(f" Findings submitted: {len(self._findings_submitted)}")
328
+ for i, f in enumerate(self._findings_submitted[-3:], 1):
329
+ parts.append(f" {i}. {f[:80]}")
330
+ if self._remediation_submitted:
331
+ parts.append(f" Remediations proposed: {len(self._remediation_submitted)}")
332
+ areas = []
333
+ for area, checked in self._discovered_info.items():
334
+ if checked:
335
+ areas.append(area)
336
+ if areas:
337
+ parts.append(f" Areas investigated: {', '.join(areas)}")
338
+
339
+ if not parts:
340
+ return ""
341
+ return "\n\nAUDIT PROGRESS:\n" + "\n".join(parts)
342
+
343
  # ------------------------------------------------------------------
344
  # Tool implementations
345
  # ------------------------------------------------------------------
 
402
  outcome = self._advance_state("get_system_overview")
403
 
404
  s = self._scenario
405
+ # Build investigation brief
406
+ brief = (
407
+ f"COMPLIANCE AUDIT ASSIGNMENT\n"
408
+ f"{'=' * 60}\n"
409
+ f"System: {s.system_name} {s.get_param('version')}\n"
410
+ f"Deployer: {self._render_doc(s.deployer_info)}\n"
411
+ f"Region: {s.get_param('region')}\n"
412
+ f"Users: {s.get_param('user_count')}\n"
413
+ f"Deployed: {s.get_param('deployment_date')}\n"
414
+ f"{'=' * 60}\n\n"
415
+ f"SYSTEM DESCRIPTION:\n"
416
+ f" {self._render_doc(s.system_description)}\n\n"
417
+ f"DEPLOYER'S CLAIMED RISK CATEGORY: "
418
+ f"{'\"' + s.system_category.replace('_', ' ').title() + '\"' if s.difficulty == 'easy' else 'Not provided — to be determined by auditor'}\n\n"
419
+ f"AUDIT SCOPE:\n"
420
+ f" Conduct a thorough compliance audit under the EU AI Act.\n"
421
+ f" Investigate documentation, training data, human oversight,\n"
422
+ f" transparency, risk management, and logging as applicable.\n"
423
+ f" Submit findings and remediation recommendations.\n\n"
424
+ f" Tools available: {QUERY_BUDGET - self._queries_used} queries remaining."
425
+ )
426
+ return json.dumps({
427
+ "document_type": "System Overview & Audit Assignment",
428
+ "content": brief,
429
  "queries_remaining": QUERY_BUDGET - self._queries_used,
430
+ }, indent=2)
 
431
 
432
  def _tool_classify_system(self, risk_category: str) -> str:
433
  budget_err = self._use_query()
 
452
  "queries_remaining": QUERY_BUDGET - self._queries_used,
453
  })
454
 
455
+ def _remediation_overlay(self, area: str) -> str:
456
+ """Generate post-remediation overlay content for a re-investigated area.
457
+
458
+ When the agent recommends fixes and then re-checks a tool, the
459
+ environment shows how the proposed remediation would affect the area.
460
+ This makes the environment feel like a living system that responds
461
+ to the agent's actions.
462
+ """
463
+ if not self._remediation_submitted:
464
+ return ""
465
+
466
+ # Map areas to relevant remediation keywords
467
+ area_keywords = {
468
+ "documentation": ["documentation", "annex_iv", "technical_doc", "document"],
469
+ "training_data": ["bias", "audit", "data_governance", "training", "demographic"],
470
+ "oversight": ["human_review", "oversight", "human_oversight", "monitor"],
471
+ "transparency": ["disclosure", "transparency", "labeling", "notification"],
472
+ "risk_management": ["conformity", "risk_management", "assessment", "risk"],
473
+ "logging": ["logging", "traceability", "audit_trail", "record"],
474
+ }
475
+
476
+ relevant_remediations = []
477
+ keywords = area_keywords.get(area, [])
478
+ for rem in self._remediation_submitted:
479
+ if any(kw in rem for kw in keywords):
480
+ relevant_remediations.append(rem)
481
+
482
+ if not relevant_remediations:
483
+ return ""
484
+
485
+ lines = ["\n\nREMEDIATION STATUS UPDATE:"]
486
+ lines.append(" The following remediation actions have been proposed for this area:")
487
+ for i, rem in enumerate(relevant_remediations, 1):
488
+ lines.append(f" {i}. {rem}")
489
+ lines.append(" Status: PROPOSED (pending implementation)")
490
+ lines.append(" Note: These are recommendations only. Re-investigation reflects")
491
+ lines.append(" the current pre-remediation state of the system.")
492
+ return "\n".join(lines)
493
+
494
+ def _get_deep_content(self, area: str) -> str:
495
+ """Get deep-dive content for repeat investigation calls."""
496
+ deep_map = {
497
+ "documentation": self._scenario.deep_documentation,
498
+ "training_data": self._scenario.deep_training_data,
499
+ "oversight": self._scenario.deep_oversight,
500
+ "transparency": self._scenario.deep_transparency,
501
+ "risk_management": self._scenario.deep_risk_assessment,
502
+ "logging": self._scenario.deep_logging,
503
+ }
504
+ return deep_map.get(area, "")
505
+
506
+ def _investigation_response(self, doc_type: str, content: str, area: str = "") -> str:
507
+ """Standard response format for investigation tools with dynamic state.
508
+
509
+ Features adaptive depth: repeat calls to the same tool reveal deeper
510
+ forensic analysis, additional statistics, and drill-down detail that
511
+ wasn't visible on the first pass.
512
+ """
513
+ # Track call count for adaptive depth
514
+ tool_key = area or doc_type
515
+ self._tool_call_counts[tool_key] = self._tool_call_counts.get(tool_key, 0) + 1
516
+ call_count = self._tool_call_counts[tool_key]
517
+
518
+ rendered = self._render_doc(content)
519
+
520
+ # Adaptive depth: on repeat calls, append deep-dive content
521
+ if call_count >= 2 and area:
522
+ deep = self._get_deep_content(area)
523
+ if deep:
524
+ rendered += "\n\n" + self._render_doc(deep)
525
+
526
+ # Add remediation overlay if agent has proposed fixes for this area
527
+ overlay = self._remediation_overlay(area)
528
+ if overlay:
529
+ rendered += overlay
530
+
531
+ # Add audit progress section
532
+ progress = self._audit_progress_section()
533
+ if progress:
534
+ rendered += progress
535
+
536
+ result = {
537
+ "document_type": doc_type,
538
+ "content": rendered,
539
+ "queries_remaining": QUERY_BUDGET - self._queries_used,
540
+ }
541
+ if call_count >= 2:
542
+ result["note"] = "DEEP DIVE: Additional forensic detail revealed on re-investigation."
543
+
544
+ return json.dumps(result, indent=2)
545
+
546
  def _tool_check_documentation(self) -> str:
547
  budget_err = self._use_query()
548
  if budget_err:
549
  return budget_err
550
  self._discovered_info["documentation"] = True
551
  self._observation_after_investigation += 1
552
+ self._advance_state("check_documentation")
553
+ return self._investigation_response("Technical Documentation Review", self._scenario.documentation_data, "documentation")
 
 
 
554
 
555
  def _tool_audit_training_data(self) -> str:
556
  budget_err = self._use_query()
 
558
  return budget_err
559
  self._discovered_info["training_data"] = True
560
  self._observation_after_investigation += 1
561
+ self._advance_state("audit_training_data")
562
+ return self._investigation_response("Training Data Audit Report", self._scenario.training_data_info, "training_data")
 
 
 
563
 
564
  def _tool_verify_human_oversight(self) -> str:
565
  budget_err = self._use_query()
 
567
  return budget_err
568
  self._discovered_info["oversight"] = True
569
  self._observation_after_investigation += 1
570
+ self._advance_state("verify_human_oversight")
571
+ return self._investigation_response("Human Oversight Assessment", self._scenario.oversight_info, "oversight")
 
 
 
572
 
573
  def _tool_check_transparency(self) -> str:
574
  budget_err = self._use_query()
 
576
  return budget_err
577
  self._discovered_info["transparency"] = True
578
  self._observation_after_investigation += 1
579
+ self._advance_state("check_transparency")
580
+ return self._investigation_response("Transparency & Disclosure Review", self._scenario.transparency_info, "transparency")
 
 
 
581
 
582
  def _tool_assess_risk_management(self) -> str:
583
  budget_err = self._use_query()
 
585
  return budget_err
586
  self._discovered_info["risk_management"] = True
587
  self._observation_after_investigation += 1
588
+ self._advance_state("assess_risk_management")
589
+ return self._investigation_response("Risk Management & Conformity Assessment", self._scenario.risk_assessment_info, "risk_management")
 
 
 
590
 
591
  def _tool_check_logging(self) -> str:
592
  budget_err = self._use_query()
 
594
  return budget_err
595
  self._discovered_info["logging"] = True
596
  self._observation_after_investigation += 1
597
+ self._advance_state("check_logging")
598
+ return self._investigation_response("Logging & Traceability Review", self._scenario.logging_info, "logging")
 
 
 
599
 
600
  def _tool_submit_finding(self, finding: str, severity: str = "high") -> str:
601
  budget_err = self._use_query()
 
603
  return budget_err
604
  self._findings_submitted.append(finding.lower().strip())
605
  outcome = self._advance_state("submit_finding")
606
+
607
+ # Evidence chain validation — check if agent investigated relevant areas
608
+ evidence_warnings = []
609
+ finding_lower = finding.lower()
610
+ EVIDENCE_MAP = {
611
+ "bias": "training_data",
612
+ "discrimination": "training_data",
613
+ "data_governance": "training_data",
614
+ "callback": "training_data",
615
+ "demographic": "training_data",
616
+ "oversight": "oversight",
617
+ "human_review": "oversight",
618
+ "human_oversight": "oversight",
619
+ "article_14": "oversight",
620
+ "documentation": "documentation",
621
+ "annex_iv": "documentation",
622
+ "technical_doc": "documentation",
623
+ "transparency": "transparency",
624
+ "disclosure": "transparency",
625
+ "article_50": "transparency",
626
+ "labeling": "transparency",
627
+ "watermark": "transparency",
628
+ "risk_management": "risk_management",
629
+ "conformity": "risk_management",
630
+ "article_9": "risk_management",
631
+ "logging": "logging",
632
+ "traceability": "logging",
633
+ "article_12": "logging",
634
+ "audit_trail": "logging",
635
+ }
636
+ relevant_areas = set()
637
+ for keyword, area in EVIDENCE_MAP.items():
638
+ if keyword in finding_lower:
639
+ relevant_areas.add(area)
640
+
641
+ uninvestigated = [a for a in relevant_areas if not self._discovered_info.get(a)]
642
+ if uninvestigated:
643
+ evidence_warnings.append(
644
+ f"Note: Finding references {', '.join(uninvestigated)} "
645
+ f"but you have not investigated {'this area' if len(uninvestigated) == 1 else 'these areas'} yet. "
646
+ f"Findings are stronger when supported by evidence from investigation tools."
647
+ )
648
+
649
+ result = {
650
  "finding_recorded": finding,
651
  "severity": severity,
652
  "total_findings": len(self._findings_submitted),
653
  "queries_remaining": QUERY_BUDGET - self._queries_used,
654
+ }
655
+ if evidence_warnings:
656
+ result["evidence_warnings"] = evidence_warnings
657
+ return json.dumps(result)
658
 
659
  def _tool_recommend_fix(self, finding: str, remediation: str, priority: int = 1) -> str:
660
  budget_err = self._use_query()
 
694
 
695
  self._reward = breakdown.total()
696
 
697
+ # Build detailed audit report showing what was found vs missed
698
+ ground_truth = self._scenario.ground_truth_findings
699
+ found_count = 0
700
+ missed = []
701
+ for gt in ground_truth:
702
+ gt_lower = gt.lower()
703
+ gt_tokens = set(gt_lower.replace("-", "_").split("_")) - {""}
704
+ matched = False
705
+ for sub in self._findings_submitted:
706
+ sub_tokens = set(sub.replace("-", "_").split("_")) - {""}
707
+ overlap = len(gt_tokens & sub_tokens)
708
+ if overlap >= 2 or (gt_tokens and overlap / len(gt_tokens) >= 0.4):
709
+ matched = True
710
+ break
711
+ if matched:
712
+ found_count += 1
713
+ else:
714
+ missed.append(gt)
715
+
716
+ # Classification feedback
717
+ correct_class = self._scenario.correct_classification.lower()
718
+ class_correct = self._classification_submitted == correct_class
719
+
720
+ audit_report = {
721
  "done": True,
722
  "reward": self._reward,
 
723
  "reward_breakdown": breakdown.to_dict(),
724
+ "audit_summary": {
725
+ "classification": {
726
+ "submitted": self._classification_submitted or "(none)",
727
+ "correct": correct_class,
728
+ "match": "exact" if class_correct else "partial" if breakdown.classification > 0 else "wrong",
729
+ },
730
+ "findings": {
731
+ "submitted": len(self._findings_submitted),
732
+ "ground_truth_total": len(ground_truth),
733
+ "matched": found_count,
734
+ "missed": missed,
735
+ },
736
+ "remediation_count": len(self._remediation_submitted),
737
+ "areas_investigated": [k for k, v in self._discovered_info.items() if v],
738
+ "tool_calls_used": self._queries_used,
739
+ "episode_duration_seconds": round(time.time() - self._start_time, 1),
740
+ },
741
+ }
742
+
743
+ return json.dumps(audit_report, indent=2)
744
 
745
  def close(self) -> None:
746
  pass
server/gradio_landing.py CHANGED
@@ -270,8 +270,8 @@ def _audit_flow_html(scenario_id: str) -> str:
270
 
271
  def _hero_html() -> str:
272
  stats = [
273
- ("SCENARIOS", "8"), ("MCP TOOLS", "11"), ("REWARD COMPS", "6"),
274
- ("TIERS", "3"), ("QUERY BUDGET", "100"), ("EU DEADLINE", "Aug '26"),
275
  ]
276
  stat_boxes = "".join(
277
  f'<div class="stat"><div class="val">{v}</div><div class="label">{k}</div></div>'
@@ -281,9 +281,11 @@ def _hero_html() -> str:
281
  <div class="hero">
282
  <div><span class="accent-bar"></span><h1 style="display:inline;vertical-align:middle;">EU AI Act Compliance Auditor</h1></div>
283
  <p class="subtitle">
284
- An MCP-based environment where LLM agents audit AI systems for EU AI Act compliance.
285
- 8 scenarios from chatbot transparency to prohibited social scoring.
286
- Parameter randomization on every reset prevents memorization &mdash; agents must learn the <em>audit process</em>, not specific answers.
 
 
287
  </p>
288
  <div class="stats">{stat_boxes}</div>
289
  </div>"""
@@ -291,12 +293,12 @@ def _hero_html() -> str:
291
 
292
  def _design_cards_html() -> str:
293
  cards_data = [
294
- ("\u00A7", "Real Regulatory Scenarios", "Based on actual EU AI Act articles: prohibited social scoring (Art. 5), high-risk hiring (Annex III), deepfake transparency (Art. 50), medical device audits. Not toy problems."),
295
- ("\u2699", "Full Audit Toolkit", "11 MCP tools mirror a compliance auditor's workflow: system overview, risk classification, documentation review, bias audit, oversight verification, transparency check, risk assessment, logging verification."),
296
- ("\u25C8", "State-Graph Audit Process", "Each scenario is a directed graph with progress / no_effect / worsened transitions. Partial credit via BFS depth along the optimal path. Wrong audit steps waste your query budget."),
297
- ("\u25C9", "6-Component Reward", "Classification accuracy (20%), finding completeness (25%), finding precision (15%), remediation quality (15%), methodology adherence (15%), efficiency (10%). Anti-exploit design."),
298
- ("\u27F3", "Parameter Randomization", "Company names, deployment dates, regions, and system versions re-rolled on every reset. 65K+ unique instances per scenario. Agents must generalize."),
299
- ("\u23F1", "Enforcement: Aug 2026", "EU AI Act enforcement begins August 2, 2026. Fines up to EUR 35M or 7% of global revenue. Every company deploying AI in Europe needs compliance auditing."),
300
  ]
301
  cards = ""
302
  for icon, title, desc in cards_data:
@@ -427,6 +429,71 @@ def _leaderboard_html() -> str:
427
  return f'<table class="lb">{header}{rows}</table>'
428
 
429
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
430
  def _architecture_html() -> str:
431
  reward_items = [
432
  ("Classification Accuracy", "20%", "Correct risk category (prohibited / high_risk / limited_risk / minimal_risk)"),
@@ -500,6 +567,59 @@ def _architecture_html() -> str:
500
  </div>"""
501
 
502
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
503
  def _try_it_html() -> str:
504
  return f"""
505
  <div style="display:grid;grid-template-columns:1fr 1fr;gap:16px;">
@@ -550,18 +670,18 @@ def _pg_reset(difficulty: str) -> Tuple:
550
 
551
  def _pg_call(sid: str, tool_name: str, args_str: str) -> Tuple:
552
  if not sid:
553
- return "Click Reset first", {"error": "No session"}
554
  with _pg_lock:
555
  env = _pg_sessions.get(sid)
556
  if not env:
557
- return "Session expired", {"error": "Session not found"}
558
  fn = env._tool_fns.get(tool_name)
559
  if not fn:
560
- return f"Unknown tool: {tool_name}", {"error": "Unknown tool"}
561
  try:
562
  kwargs = json.loads(args_str) if args_str and args_str.strip() else {}
563
  except json.JSONDecodeError:
564
- return "Invalid JSON", {"error": "Bad JSON in arguments"}
565
  try:
566
  result = fn(**kwargs)
567
  parsed = json.loads(result) if isinstance(result, str) else result
@@ -571,9 +691,32 @@ def _pg_call(sid: str, tool_name: str, args_str: str) -> Tuple:
571
  status = f"Queries: {queries}/100 | Findings: {len(env._findings_submitted)} | Done: {done}"
572
  if done:
573
  status += f" | REWARD: {reward:.4f}"
574
- return status, parsed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
575
  except Exception as e:
576
- return f"Error: {e}", {"error": str(e)}
577
 
578
 
579
  # ── Build the Gradio app ────────────────────────────────────────
@@ -620,13 +763,13 @@ def create_landing_app() -> gr.Blocks:
620
  # ── TAB 4: Playground ──
621
  with gr.Tab("Playground"):
622
  gr.HTML(f"<h2>Interactive Audit</h2>")
623
- gr.HTML(f'<p style="color:{MUTED};margin-bottom:12px;">Reset to start a session, then call tools in sequence. The environment tracks your audit state and scores your methodology.</p>')
624
 
625
  session_state = gr.State(value=None)
626
  pg_status = gr.Textbox(label="Status", interactive=False, value="Click Reset to begin")
627
 
628
  with gr.Row(elem_classes="pg-row"):
629
- pg_diff = gr.Dropdown(choices=["easy", "medium", "hard"], value="easy", label="Difficulty")
630
  pg_reset_btn = gr.Button("Reset", variant="primary", min_width=120)
631
 
632
  with gr.Row(elem_classes="pg-row"):
@@ -634,25 +777,37 @@ def create_landing_app() -> gr.Blocks:
634
  pg_args = gr.Textbox(label="Arguments (JSON)", placeholder='{"risk_category": "high_risk"}')
635
  pg_call_btn = gr.Button("Call Tool", variant="secondary", min_width=120)
636
 
637
- pg_result = gr.JSON(label="Result")
 
 
 
638
 
639
  def _on_reset(diff):
640
  sid, status, obs = _pg_reset(diff)
641
- return sid, status, obs
 
642
 
643
  def _on_call(sid, tool, args):
644
- status, result = _pg_call(sid, tool, args)
645
- return status, result
646
 
647
- pg_reset_btn.click(_on_reset, [pg_diff], [session_state, pg_status, pg_result])
648
- pg_call_btn.click(_on_call, [session_state, pg_tool, pg_args], [pg_status, pg_result])
649
 
650
  # ── TAB 5: Architecture ──
651
  with gr.Tab("Architecture"):
652
  gr.HTML(f"<h2>Environment Architecture</h2>")
 
 
653
  gr.HTML(_architecture_html())
654
 
655
- # ── TAB 6: Try It ──
 
 
 
 
 
 
656
  with gr.Tab("Try It"):
657
  gr.HTML(f"<h2>Run the baseline yourself</h2>")
658
  gr.HTML(_try_it_html())
 
270
 
271
  def _hero_html() -> str:
272
  stats = [
273
+ ("FIXED SCENARIOS", "9"), ("PROCEDURAL", "\u221E"), ("MCP TOOLS", "11"),
274
+ ("REWARD COMPS", "6"), ("TESTS", "74"), ("EU DEADLINE", "Aug '26"),
275
  ]
276
  stat_boxes = "".join(
277
  f'<div class="stat"><div class="val">{v}</div><div class="label">{k}</div></div>'
 
281
  <div class="hero">
282
  <div><span class="accent-bar"></span><h1 style="display:inline;vertical-align:middle;">EU AI Act Compliance Auditor</h1></div>
283
  <p class="subtitle">
284
+ An MCP environment where LLM agents audit AI systems for EU AI Act compliance.
285
+ Tools return investigation-grade regulatory documents &mdash; statistical tables, documentation inventories,
286
+ operational procedures &mdash; that require genuine analysis to identify violations.
287
+ No pre-digested verdicts. The agent must reason about evidence across 8 scenarios spanning
288
+ prohibited social scoring, high-risk hiring bias, medical device compliance, and multi-system corporate audits.
289
  </p>
290
  <div class="stats">{stat_boxes}</div>
291
  </div>"""
 
293
 
294
  def _design_cards_html() -> str:
295
  cards_data = [
296
+ ("\u00A7", "Investigation-Grade Documents", "Tools return 30-70 line regulatory documents: Annex IV cross-reference tables, demographic callback rate matrices, operational procedure extracts. No labels like 'COMPLIANT' or 'FAILED' &mdash; the agent must analyze the evidence and reason about violations."),
297
+ ("\u2699", "Dynamic Audit State", "The environment responds to the agent's actions in real-time. After submitting findings, subsequent tool calls show audit progress. After classification, investigation tools reflect the current audit context. The environment feels alive, not static."),
298
+ ("\u25C8", "5 Unique Graph Topologies", "Each scenario has a distinct state graph. Prohibited systems have short detection paths (5 steps). Full high-risk audits require 11 steps across all investigation tools. Wrong tool order triggers worsened transitions. BFS-based partial credit."),
299
+ ("\u25C9", "12 Anti-Gaming Tests", "Adversarial test suite proves the reward can't be gamed: skip investigation, spam findings, red herring bait, hallucinated findings, wrong classification isolation, fewer-than-optimal rushing, and 6 more exploit strategies. All proven ineffective."),
300
+ ("\u27F3", "Cross-Document Reasoning", "Findings require correlating evidence across multiple tools. Hiring bias: training data shows 23% callback gap (audit_training_data) while only 5% of rejections reviewed (verify_human_oversight). Social scoring: 'wellness app' framing (overview) vs. public service access impact (check_transparency)."),
301
+ ("\u221E", "Procedural Scenario Generator", "Beyond the 9 fixed scenarios, a seed-based procedural generator combines 5 system types &times; 16 violation templates &times; 5 red herrings to produce <strong>infinite unique scenarios</strong>. Use <code>procedural_medium_42</code> as scenario ID &mdash; every seed creates a different audit. Impossible to memorize."),
302
  ]
303
  cards = ""
304
  for icon, title, desc in cards_data:
 
429
  return f'<table class="lb">{header}{rows}</table>'
430
 
431
 
432
+ def _investigation_depth_html() -> str:
433
+ """Show the before/after of investigation-grade tool responses."""
434
+ return f"""
435
+ <div class="arch-box" style="margin-bottom:16px;">
436
+ <h3 style="color:{GOLD};">Investigation-Grade Tool Responses</h3>
437
+ <p style="color:{MUTED};font-size:13px;margin-bottom:14px;">Tools return realistic regulatory documents requiring analysis — not pre-digested answers.</p>
438
+ <div style="display:grid;grid-template-columns:1fr 1fr;gap:16px;">
439
+ <div>
440
+ <h4 style="color:{ROSE};font-size:11px;letter-spacing:0.05em;margin-bottom:8px;">TYPICAL ENV (pre-digested)</h4>
441
+ <div class="code-block" style="font-size:11px;color:{MUTED};border-color:{ROSE}40;">{{"bias_assessment": "FAILED",
442
+ "callback_rate_gap": "23%",
443
+ "article_14_compliance": "NON-COMPLIANT",
444
+ "human_oversight": "INSUFFICIENT"}}</div>
445
+ </div>
446
+ <div>
447
+ <h4 style="color:{EMERALD};font-size:11px;letter-spacing:0.05em;margin-bottom:8px;">THIS ENV (investigation-grade)</h4>
448
+ <div class="code-block" style="font-size:11px;color:{EMERALD};border-color:{EMERALD}40;">CALLBACK RATES BY DEMOGRAPHIC:
449
+ Group Rate vs Baseline
450
+ Male applicants 34.2% (baseline)
451
+ Female applicants 26.3% -23.1%
452
+ Eastern EU 27.4% -19.9%
453
+
454
+ REVIEW STATISTICS (Q4 2025):
455
+ Auto-rejected: 208,375 (60.0%)
456
+ QA sample: 10,419 (5.0%)
457
+ QA overrides: 312 (3.0%)</div>
458
+ </div>
459
+ </div>
460
+ <p style="color:{MUTED};font-size:12px;margin-top:10px;">The agent must identify the 23% callback disparity from the table, recognize that 95% of rejections have no human review,
461
+ and correlate these across documents to form findings. No verdict is pre-computed.</p>
462
+ </div>"""
463
+
464
+
465
+ def _antigaming_html() -> str:
466
+ """Anti-gaming test showcase."""
467
+ tests = [
468
+ ("Skip Investigation", "Submit correct findings without reading documents", "methodology = 0.0"),
469
+ ("Spam Findings", "Flood 16 findings hoping to hit ground truth", "precision < 0.50"),
470
+ ("Red Herring Bait", "Submit red herrings as violations", "precision = 0.0, completeness = 0.0"),
471
+ ("Immediate Verify", "Call verify_compliance with empty inputs", "total < 0.05"),
472
+ ("Wrong Classification", "Everything correct except risk category", "loses &ge; 10% gap"),
473
+ ("Skip Remediation", "Find all violations but propose no fixes", "remediation = 0.0"),
474
+ ("Classify Before Overview", "Skip system understanding", "methodology < 0.50"),
475
+ ("Rush (Fewer Steps)", "Game efficiency by taking fewer steps", "efficiency penalized"),
476
+ ("Hallucinate Findings", "Submit plausible-sounding false findings", "completeness < 0.40"),
477
+ ("Wrong Class on Prohibited", "Call prohibited system high_risk", "classification = 0.40"),
478
+ ("Perfect Run Sanity", "Legitimate perfect audit", "total > 0.85"),
479
+ ("Bounds Check", "All scenarios x all inputs", "reward in (0.001, 0.999)"),
480
+ ]
481
+ rows = ""
482
+ for name, strategy, result in tests:
483
+ rows += f'<tr><td style="color:{TEXT};font-weight:500;">{name}</td><td style="color:{MUTED};font-size:12px;">{strategy}</td><td style="color:{ROSE};font-family:monospace;font-size:12px;">{result}</td></tr>'
484
+ return f"""
485
+ <div class="arch-box" style="margin-bottom:16px;">
486
+ <h3 style="color:{GOLD};">12 Anti-Gaming Tests</h3>
487
+ <p style="color:{MUTED};font-size:13px;margin-bottom:10px;">Adversarial test suite proving the reward function is robust against common exploits.</p>
488
+ <table style="width:100%;border-collapse:collapse;font-size:13px;">
489
+ <tr><th style="text-align:left;color:{MUTED};font-size:10px;padding:6px 8px;border-bottom:1px solid {BORDER};">EXPLOIT</th>
490
+ <th style="text-align:left;color:{MUTED};font-size:10px;padding:6px 8px;border-bottom:1px solid {BORDER};">STRATEGY</th>
491
+ <th style="text-align:left;color:{MUTED};font-size:10px;padding:6px 8px;border-bottom:1px solid {BORDER};">RESULT</th></tr>
492
+ {rows}
493
+ </table>
494
+ </div>"""
495
+
496
+
497
  def _architecture_html() -> str:
498
  reward_items = [
499
  ("Classification Accuracy", "20%", "Correct risk category (prohibited / high_risk / limited_risk / minimal_risk)"),
 
567
  </div>"""
568
 
569
 
570
+ def _compliance_map_html() -> str:
571
+ """EU AI Act article coverage matrix — unique to compliance audit domain."""
572
+ mappings = [
573
+ ("Article 5", "Prohibited Practices", "classify_system", ["hard_social_scoring"]),
574
+ ("Article 6 + Annex III", "High-Risk Classification", "classify_system, assess_risk_management", ["medium_hiring", "medium_credit", "medium_medical", "hard_multi_system"]),
575
+ ("Article 9", "Risk Management", "assess_risk_management", ["medium_hiring", "medium_credit", "medium_medical"]),
576
+ ("Article 10", "Data Governance", "audit_training_data", ["medium_hiring", "medium_credit", "medium_medical", "hard_multi_system"]),
577
+ ("Article 12", "Record-Keeping", "check_logging", ["medium_hiring", "medium_medical", "hard_deepfake", "hard_multi_system"]),
578
+ ("Article 13", "Transparency (Deployers)", "check_transparency, check_documentation", ["medium_hiring", "medium_credit", "medium_medical"]),
579
+ ("Article 14", "Human Oversight", "verify_human_oversight", ["medium_hiring", "medium_credit", "medium_medical", "hard_multi_system"]),
580
+ ("Article 50", "Transparency (All AI)", "check_transparency", ["easy_chatbot", "hard_deepfake"]),
581
+ ("Annex IV", "Technical Documentation", "check_documentation", ["medium_hiring", "medium_credit", "medium_medical", "hard_deepfake"]),
582
+ ("MDR + AI Act", "Medical Device Dual-Regulation", "check_documentation, assess_risk_management", ["medium_medical"]),
583
+ ]
584
+
585
+ rows = ""
586
+ for article, title, tools_str, scenarios in mappings:
587
+ tool_badges = " ".join(
588
+ f'<span style="background:{AMBER}15;color:{AMBER};padding:2px 8px;border-radius:4px;font-size:10px;font-family:monospace;">{t.strip()}</span>'
589
+ for t in tools_str.split(",")
590
+ )
591
+ scenario_badges = " ".join(
592
+ f'<span style="background:{BLUE}15;color:{BLUE};padding:2px 6px;border-radius:4px;font-size:10px;">{s}</span>'
593
+ for s in scenarios
594
+ )
595
+ rows += f'''<tr>
596
+ <td style="padding:10px 8px;border-bottom:1px solid {BORDER}10;white-space:nowrap;">
597
+ <strong style="color:{GOLD};">{article}</strong><br/>
598
+ <span style="color:{MUTED};font-size:11px;">{title}</span>
599
+ </td>
600
+ <td style="padding:10px 8px;border-bottom:1px solid {BORDER}10;">{tool_badges}</td>
601
+ <td style="padding:10px 8px;border-bottom:1px solid {BORDER}10;">{scenario_badges}</td>
602
+ </tr>'''
603
+
604
+ return f"""<table style="width:100%;border-collapse:collapse;">
605
+ <tr>
606
+ <th style="text-align:left;color:{MUTED};font-size:10px;letter-spacing:0.06em;padding:8px;border-bottom:1px solid {BORDER};">ARTICLE</th>
607
+ <th style="text-align:left;color:{MUTED};font-size:10px;letter-spacing:0.06em;padding:8px;border-bottom:1px solid {BORDER};">INVESTIGATION TOOLS</th>
608
+ <th style="text-align:left;color:{MUTED};font-size:10px;letter-spacing:0.06em;padding:8px;border-bottom:1px solid {BORDER};">SCENARIOS</th>
609
+ </tr>
610
+ {rows}
611
+ </table>
612
+ <div style="margin-top:16px;padding:16px;background:{CARD};border:1px solid {BORDER};border-radius:10px;">
613
+ <h4 style="color:{GOLD};font-size:13px;margin-bottom:8px;">Cross-Document Reasoning Requirements</h4>
614
+ <div style="color:{MUTED};font-size:12px;line-height:1.8;">
615
+ <strong style="color:{TEXT};">Hiring Bias (5 findings):</strong> audit_training_data reveals 23% callback gap &rarr; verify_human_oversight shows only 5% review rate &rarr; check_documentation confirms missing FRIA &rarr; agent must connect all three<br/>
616
+ <strong style="color:{TEXT};">Social Scoring (5 findings):</strong> get_system_overview frames as "wellness app" &rarr; check_transparency reveals service access impact &rarr; verify_human_oversight shows municipal integration &rarr; agent must recognize Art. 5 violation<br/>
617
+ <strong style="color:{TEXT};">Multi-System (6 findings):</strong> audit_training_data reveals cross-system data flows &rarr; check_documentation shows missing combined DPIA &rarr; verify_human_oversight reveals no unified oversight &rarr; compound risk emerges across documents<br/>
618
+ <strong style="color:{TEXT};">Medical Triage (4 findings):</strong> audit_training_data shows age-bias in 75+ cohort &rarr; check_documentation confirms retrospective-only validation &rarr; check_logging reveals no real-time monitoring &rarr; safety gap pattern
619
+ </div>
620
+ </div>"""
621
+
622
+
623
  def _try_it_html() -> str:
624
  return f"""
625
  <div style="display:grid;grid-template-columns:1fr 1fr;gap:16px;">
 
670
 
671
  def _pg_call(sid: str, tool_name: str, args_str: str) -> Tuple:
672
  if not sid:
673
+ return "Click Reset first", "(no session)", {"error": "No session"}
674
  with _pg_lock:
675
  env = _pg_sessions.get(sid)
676
  if not env:
677
+ return "Session expired", "(expired)", {"error": "Session not found"}
678
  fn = env._tool_fns.get(tool_name)
679
  if not fn:
680
+ return f"Unknown tool: {tool_name}", "(error)", {"error": "Unknown tool"}
681
  try:
682
  kwargs = json.loads(args_str) if args_str and args_str.strip() else {}
683
  except json.JSONDecodeError:
684
+ return "Invalid JSON", "(error)", {"error": "Bad JSON in arguments"}
685
  try:
686
  result = fn(**kwargs)
687
  parsed = json.loads(result) if isinstance(result, str) else result
 
691
  status = f"Queries: {queries}/100 | Findings: {len(env._findings_submitted)} | Done: {done}"
692
  if done:
693
  status += f" | REWARD: {reward:.4f}"
694
+
695
+ # Extract document content for rich display
696
+ doc_content = parsed.get("content", "")
697
+ if not doc_content and "audit_summary" in parsed:
698
+ # Verify compliance result — format nicely
699
+ summary = parsed["audit_summary"]
700
+ lines = [f"AUDIT COMPLETE — Reward: {parsed.get('reward', 0):.4f}"]
701
+ lines.append(f"\nClassification: {summary['classification']['submitted']} "
702
+ f"({'correct' if summary['classification']['match'] == 'exact' else summary['classification']['match']})")
703
+ lines.append(f"Correct answer: {summary['classification']['correct']}")
704
+ lines.append(f"\nFindings: {summary['findings']['matched']}/{summary['findings']['ground_truth_total']} matched")
705
+ if summary["findings"]["missed"]:
706
+ lines.append("Missed:")
707
+ for m in summary["findings"]["missed"]:
708
+ lines.append(f" - {m}")
709
+ lines.append(f"\nAreas investigated: {', '.join(summary.get('areas_investigated', []))}")
710
+ lines.append(f"\nReward breakdown:")
711
+ for k, v in parsed.get("reward_breakdown", {}).items():
712
+ lines.append(f" {k}: {v}")
713
+ doc_content = "\n".join(lines)
714
+ elif not doc_content:
715
+ doc_content = json.dumps(parsed, indent=2)
716
+
717
+ return status, doc_content, parsed
718
  except Exception as e:
719
+ return f"Error: {e}", str(e), {"error": str(e)}
720
 
721
 
722
  # ── Build the Gradio app ────────────────────────────────────────
 
763
  # ── TAB 4: Playground ──
764
  with gr.Tab("Playground"):
765
  gr.HTML(f"<h2>Interactive Audit</h2>")
766
+ gr.HTML(f'<p style="color:{MUTED};margin-bottom:12px;">Reset to start a session, then call tools in sequence. The environment tracks your audit state and scores your methodology. Documents render below — this is what the agent sees.</p>')
767
 
768
  session_state = gr.State(value=None)
769
  pg_status = gr.Textbox(label="Status", interactive=False, value="Click Reset to begin")
770
 
771
  with gr.Row(elem_classes="pg-row"):
772
+ pg_diff = gr.Dropdown(choices=["easy", "medium", "hard"], value="medium", label="Difficulty")
773
  pg_reset_btn = gr.Button("Reset", variant="primary", min_width=120)
774
 
775
  with gr.Row(elem_classes="pg-row"):
 
777
  pg_args = gr.Textbox(label="Arguments (JSON)", placeholder='{"risk_category": "high_risk"}')
778
  pg_call_btn = gr.Button("Call Tool", variant="secondary", min_width=120)
779
 
780
+ pg_doc = gr.Textbox(label="Document Content (what the agent sees)", lines=20, interactive=False)
781
+
782
+ with gr.Accordion("Raw JSON Response", open=False):
783
+ pg_result = gr.JSON(label="Raw")
784
 
785
  def _on_reset(diff):
786
  sid, status, obs = _pg_reset(diff)
787
+ initial_doc = obs.get("message", "Session started. Call get_system_overview to begin.")
788
+ return sid, status, initial_doc, obs
789
 
790
  def _on_call(sid, tool, args):
791
+ status, doc_content, result = _pg_call(sid, tool, args)
792
+ return status, doc_content, result
793
 
794
+ pg_reset_btn.click(_on_reset, [pg_diff], [session_state, pg_status, pg_doc, pg_result])
795
+ pg_call_btn.click(_on_call, [session_state, pg_tool, pg_args], [pg_status, pg_doc, pg_result])
796
 
797
  # ── TAB 5: Architecture ──
798
  with gr.Tab("Architecture"):
799
  gr.HTML(f"<h2>Environment Architecture</h2>")
800
+ gr.HTML(_investigation_depth_html())
801
+ gr.HTML(_antigaming_html())
802
  gr.HTML(_architecture_html())
803
 
804
+ # ── TAB 6: Compliance Map ──
805
+ with gr.Tab("Compliance Map"):
806
+ gr.HTML(f"<h2>EU AI Act Article Coverage</h2>")
807
+ gr.HTML(f'<p style="color:{MUTED};margin-bottom:16px;">How each investigation tool maps to EU AI Act provisions, and which scenarios test each article.</p>')
808
+ gr.HTML(_compliance_map_html())
809
+
810
+ # ── TAB 7: Try It ──
811
  with gr.Tab("Try It"):
812
  gr.HTML(f"<h2>Run the baseline yourself</h2>")
813
  gr.HTML(_try_it_html())
tests/test_difficulty_calibration.py ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Difficulty calibration tests.
2
+
3
+ Proves that the environment is properly calibrated: a naive agent
4
+ (same strategy for all scenarios) scores higher on easy scenarios
5
+ than on hard ones. This validates the difficulty tier design.
6
+ """
7
+
8
+ import json
9
+ from server.environment import ComplianceAuditorEnvironment
10
+ from scenarios.registry import SCENARIO_LIST
11
+
12
+
13
+ def _naive_audit(scenario_id: str) -> float:
14
+ """Run a naive audit strategy — call all tools in order, submit generic findings."""
15
+ env = ComplianceAuditorEnvironment()
16
+ env.reset(seed=42, scenario_id=scenario_id)
17
+
18
+ # Naive strategy: call everything, classify as high_risk, submit generic findings
19
+ env._tool_fns["get_system_overview"]()
20
+ env._tool_fns["classify_system"](risk_category="high_risk")
21
+ env._tool_fns["check_documentation"]()
22
+ env._tool_fns["audit_training_data"]()
23
+ env._tool_fns["verify_human_oversight"]()
24
+ env._tool_fns["check_transparency"]()
25
+ env._tool_fns["assess_risk_management"]()
26
+ env._tool_fns["check_logging"]()
27
+ env._tool_fns["submit_finding"](finding="documentation_gaps", severity="high")
28
+ env._tool_fns["submit_finding"](finding="bias_concern", severity="high")
29
+ env._tool_fns["submit_finding"](finding="insufficient_oversight", severity="medium")
30
+ env._tool_fns["recommend_fix"](finding="gaps", remediation="improve_documentation")
31
+ env._tool_fns["recommend_fix"](finding="bias", remediation="conduct_bias_audit")
32
+
33
+ result = json.loads(env._tool_fns["verify_compliance"](
34
+ risk_classification="high_risk",
35
+ overall_assessment="Multiple compliance gaps identified",
36
+ key_findings_summary="Documentation, bias, and oversight issues"
37
+ ))
38
+ return result["reward"]
39
+
40
+
41
+ def test_hard_scenarios_have_more_findings_than_easy():
42
+ """Hard scenarios require identifying more ground truth findings.
43
+
44
+ This validates difficulty calibration — easy scenarios have 1-2 findings
45
+ while hard scenarios have 5-6, making them harder to get perfect on.
46
+ """
47
+ from scenarios.registry import get_scenario
48
+
49
+ easy_findings = []
50
+ hard_findings = []
51
+
52
+ for sc_info in SCENARIO_LIST:
53
+ sc = get_scenario(sc_info["id"], 42)
54
+ count = len(sc.ground_truth_findings)
55
+ if sc_info["difficulty"] == "easy":
56
+ easy_findings.append(count)
57
+ elif sc_info["difficulty"] == "hard":
58
+ hard_findings.append(count)
59
+
60
+ avg_easy = sum(easy_findings) / len(easy_findings)
61
+ avg_hard = sum(hard_findings) / len(hard_findings)
62
+
63
+ assert avg_hard > avg_easy * 2, \
64
+ f"Hard scenarios ({avg_hard:.1f} avg findings) should have at least 2x the findings of easy ({avg_easy:.1f})"
65
+
66
+
67
+ def test_prohibited_scenario_punishes_wrong_classification():
68
+ """Classifying a prohibited system as high_risk should lose significant points.
69
+
70
+ The prohibited scenario is the hardest because the agent must see through
71
+ the deployer's framing to correctly identify it as prohibited.
72
+ """
73
+ # Naive agent classifies as high_risk (wrong for prohibited)
74
+ prohibited_score = _naive_audit("hard_social_scoring_prohibited_001")
75
+
76
+ # Perfect classification on the same scenario
77
+ env = ComplianceAuditorEnvironment()
78
+ env.reset(seed=42, scenario_id="hard_social_scoring_prohibited_001")
79
+ env._tool_fns["get_system_overview"]()
80
+ env._tool_fns["classify_system"](risk_category="prohibited")
81
+ env._tool_fns["submit_finding"](finding="prohibited_social_scoring_system")
82
+ env._tool_fns["recommend_fix"](finding="prohibited", remediation="immediate_system_shutdown")
83
+ result = json.loads(env._tool_fns["verify_compliance"](
84
+ risk_classification="prohibited",
85
+ overall_assessment="Prohibited system",
86
+ key_findings_summary="Social scoring"
87
+ ))
88
+ correct_score = result["reward"]
89
+
90
+ assert correct_score > prohibited_score, \
91
+ f"Correct prohibited ({correct_score:.3f}) should beat naive high_risk ({prohibited_score:.3f})"
92
+
93
+
94
+ def test_medium_scenarios_spread_across_difficulty():
95
+ """Medium scenarios should produce different scores with the naive agent,
96
+ showing that they test different compliance challenges.
97
+ """
98
+ medium_scores = {}
99
+ for sc_info in SCENARIO_LIST:
100
+ if sc_info["difficulty"] == "medium":
101
+ medium_scores[sc_info["id"]] = _naive_audit(sc_info["id"])
102
+
103
+ scores = list(medium_scores.values())
104
+ spread = max(scores) - min(scores)
105
+ assert spread > 0.02, \
106
+ f"Medium scenarios should have score variance. Spread: {spread:.3f}, scores: {medium_scores}"
tests/test_evidence_chain.py ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evidence chain validation tests.
2
+
3
+ Proves that the environment validates whether findings are supported by
4
+ actual investigation. This is a unique feature — most environments just
5
+ accept whatever findings are submitted without checking if the agent
6
+ actually read the relevant documents.
7
+
8
+ Tests:
9
+ 1. Finding without investigation → warning
10
+ 2. Finding after investigation → no warning
11
+ 3. Multiple keyword matching works correctly
12
+ 4. Verify_compliance shows missed findings
13
+ 5. Verify_compliance shows classification accuracy
14
+ """
15
+
16
+ import json
17
+ from server.environment import ComplianceAuditorEnvironment
18
+
19
+
20
+ def test_finding_without_investigation_warns():
21
+ """Submitting a bias finding without auditing training data triggers a warning."""
22
+ env = ComplianceAuditorEnvironment()
23
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
24
+
25
+ result = json.loads(env._tool_fns["submit_finding"](
26
+ finding="gender_bias_in_training_data", severity="critical"
27
+ ))
28
+ assert "evidence_warnings" in result, "Should warn about missing investigation"
29
+ assert any("training_data" in w for w in result["evidence_warnings"]), \
30
+ "Warning should mention training_data area"
31
+
32
+
33
+ def test_finding_after_investigation_no_warning():
34
+ """Submitting a bias finding after auditing training data has no warning."""
35
+ env = ComplianceAuditorEnvironment()
36
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
37
+
38
+ env._tool_fns["get_system_overview"]()
39
+ env._tool_fns["audit_training_data"]()
40
+
41
+ result = json.loads(env._tool_fns["submit_finding"](
42
+ finding="gender_bias_in_training_data", severity="critical"
43
+ ))
44
+ assert "evidence_warnings" not in result, \
45
+ "No warning when area was investigated"
46
+
47
+
48
+ def test_oversight_finding_warns_without_oversight_check():
49
+ """Submitting an oversight finding without verify_human_oversight warns."""
50
+ env = ComplianceAuditorEnvironment()
51
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
52
+
53
+ result = json.loads(env._tool_fns["submit_finding"](
54
+ finding="insufficient_human_oversight", severity="high"
55
+ ))
56
+ assert "evidence_warnings" in result
57
+ assert any("oversight" in w for w in result["evidence_warnings"])
58
+
59
+
60
+ def test_transparency_finding_warns_without_transparency_check():
61
+ """Submitting a transparency finding without check_transparency warns."""
62
+ env = ComplianceAuditorEnvironment()
63
+ env.reset(seed=42, scenario_id="easy_chatbot_transparency_001")
64
+
65
+ result = json.loads(env._tool_fns["submit_finding"](
66
+ finding="missing_ai_disclosure_transparency", severity="high"
67
+ ))
68
+ assert "evidence_warnings" in result
69
+ assert any("transparency" in w for w in result["evidence_warnings"])
70
+
71
+
72
+ def test_generic_finding_no_keyword_match_no_warning():
73
+ """A finding with no recognizable keywords produces no evidence warning."""
74
+ env = ComplianceAuditorEnvironment()
75
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
76
+
77
+ result = json.loads(env._tool_fns["submit_finding"](
78
+ finding="general_compliance_concern", severity="medium"
79
+ ))
80
+ assert "evidence_warnings" not in result, \
81
+ "Generic findings without keyword matches should not warn"
82
+
83
+
84
+ def test_verify_shows_missed_findings():
85
+ """Verify_compliance response shows which ground truth findings were missed."""
86
+ env = ComplianceAuditorEnvironment()
87
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
88
+
89
+ env._tool_fns["get_system_overview"]()
90
+ env._tool_fns["classify_system"](risk_category="high_risk")
91
+ env._tool_fns["submit_finding"](finding="gender_bias_in_technical_screening")
92
+
93
+ result = json.loads(env._tool_fns["verify_compliance"](
94
+ risk_classification="high_risk",
95
+ overall_assessment="Bias found",
96
+ key_findings_summary="Gender bias"
97
+ ))
98
+
99
+ summary = result["audit_summary"]
100
+ assert summary["findings"]["matched"] == 1
101
+ assert summary["findings"]["ground_truth_total"] == 5
102
+ assert len(summary["findings"]["missed"]) == 4
103
+ assert "insufficient_human_oversight" in summary["findings"]["missed"]
104
+
105
+
106
+ def test_verify_shows_classification_accuracy():
107
+ """Verify_compliance response shows classification match status."""
108
+ env = ComplianceAuditorEnvironment()
109
+ env.reset(seed=42, scenario_id="hard_social_scoring_prohibited_001")
110
+
111
+ env._tool_fns["get_system_overview"]()
112
+
113
+ # Wrong classification
114
+ result = json.loads(env._tool_fns["verify_compliance"](
115
+ risk_classification="high_risk",
116
+ overall_assessment="High risk system",
117
+ key_findings_summary="Various issues"
118
+ ))
119
+ assert result["audit_summary"]["classification"]["correct"] == "prohibited"
120
+ assert result["audit_summary"]["classification"]["match"] == "partial"
121
+
122
+ # Correct classification in a new episode
123
+ env2 = ComplianceAuditorEnvironment()
124
+ env2.reset(seed=42, scenario_id="hard_social_scoring_prohibited_001")
125
+ env2._tool_fns["get_system_overview"]()
126
+ result2 = json.loads(env2._tool_fns["verify_compliance"](
127
+ risk_classification="prohibited",
128
+ overall_assessment="Prohibited system",
129
+ key_findings_summary="Social scoring"
130
+ ))
131
+ assert result2["audit_summary"]["classification"]["match"] == "exact"
132
+
133
+
134
+ def test_verify_shows_areas_investigated():
135
+ """Verify response shows which investigation areas were actually explored."""
136
+ env = ComplianceAuditorEnvironment()
137
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
138
+
139
+ env._tool_fns["get_system_overview"]()
140
+ env._tool_fns["classify_system"](risk_category="high_risk")
141
+ env._tool_fns["check_documentation"]()
142
+ env._tool_fns["audit_training_data"]()
143
+
144
+ result = json.loads(env._tool_fns["verify_compliance"](
145
+ risk_classification="high_risk",
146
+ overall_assessment="Partial audit",
147
+ key_findings_summary="Documentation and data issues"
148
+ ))
149
+
150
+ areas = result["audit_summary"]["areas_investigated"]
151
+ assert "overview" in areas
152
+ assert "documentation" in areas
153
+ assert "training_data" in areas
154
+ # These were NOT investigated
155
+ assert "oversight" not in areas
156
+ assert "transparency" not in areas
tests/test_investigation_depth.py ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Investigation depth tests.
2
+
3
+ Verifies that tool responses contain investigation-grade content requiring
4
+ genuine analysis — not pre-digested verdicts.
5
+
6
+ Tests prove:
7
+ 1. Documents contain statistical evidence the agent must interpret
8
+ 2. Red herrings are embedded naturally in the evidence
9
+ 3. Cross-document reasoning is required to form findings
10
+ 4. Document length scales with difficulty tier
11
+ 5. Randomization changes parameterized content but not violations
12
+ 6. Dynamic audit progress appears after findings
13
+ """
14
+
15
+ import json
16
+ from server.environment import ComplianceAuditorEnvironment
17
+ from scenarios.registry import get_scenario, SCENARIO_LIST
18
+
19
+
20
+ # ── Test 1: No pre-digested verdicts in tool responses ────────────
21
+
22
+ def test_no_predigested_verdicts_in_documents():
23
+ """Investigation documents must NOT contain explicit compliance verdicts.
24
+
25
+ Labels like 'NON-COMPLIANT', 'FAILED', 'VIOLATION' hand the answer
26
+ to the agent. Documents should contain evidence, not conclusions.
27
+ """
28
+ env = ComplianceAuditorEnvironment()
29
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
30
+
31
+ predigested_labels = [
32
+ "NON-COMPLIANT", "NON_COMPLIANT", "FAILED", "VIOLATION FOUND",
33
+ "COMPLIANCE VIOLATION", "DOES NOT COMPLY",
34
+ ]
35
+
36
+ for tool_name in ["check_documentation", "audit_training_data",
37
+ "verify_human_oversight", "check_transparency",
38
+ "assess_risk_management", "check_logging"]:
39
+ result = json.loads(env._tool_fns[tool_name]())
40
+ content = result.get("content", "")
41
+ for label in predigested_labels:
42
+ assert label not in content, \
43
+ f"Pre-digested verdict '{label}' found in {tool_name} response"
44
+
45
+
46
+ # ── Test 2: Statistical evidence present in training data audit ───
47
+
48
+ def test_training_data_contains_statistical_tables():
49
+ """Training data audit must contain numerical evidence the agent
50
+ must interpret to identify bias — not just 'bias found' labels.
51
+ """
52
+ env = ComplianceAuditorEnvironment()
53
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
54
+
55
+ result = json.loads(env._tool_fns["audit_training_data"]())
56
+ content = result["content"]
57
+
58
+ # Must contain actual numbers (callback rates, percentages)
59
+ # Note: exact values vary due to seed-based noise injection
60
+ import re
61
+ pct_matches = re.findall(r'\d{1,2}\.\d%', content)
62
+ assert len(pct_matches) >= 4, \
63
+ f"Training data should contain multiple percentage figures, found {len(pct_matches)}"
64
+
65
+ # Must contain demographic categories
66
+ assert "Male" in content or "Female" in content, \
67
+ "Training data should reference demographic groups"
68
+
69
+ # Must NOT contain pre-computed verdict
70
+ assert "FAILED" not in content, \
71
+ "Training data should not contain pre-digested 'FAILED' verdict"
72
+
73
+
74
+ # ── Test 3: Red herrings embedded naturally ───────────────────────
75
+
76
+ def test_red_herrings_in_evidence():
77
+ """Red herring content should appear naturally in investigation documents,
78
+ not as separate labeled items the agent can trivially filter.
79
+ """
80
+ env = ComplianceAuditorEnvironment()
81
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
82
+
83
+ # The hiring scenario has red herrings: "prohibited_social_scoring" and "biometric_processing"
84
+ # The training data document should mention the separate fraud detection system
85
+ # (which is compliant and unrelated) as a natural red herring
86
+ result = json.loads(env._tool_fns["audit_training_data"]())
87
+ content = result["content"].lower()
88
+ assert "fraud" in content, \
89
+ "Red herring (compliant fraud system) should appear naturally in training data doc"
90
+
91
+
92
+ # ── Test 4: Document length scales with difficulty ────────────────
93
+
94
+ def test_document_length_scales_with_difficulty():
95
+ """Hard scenarios should have longer, more complex documents than easy ones."""
96
+ easy_total = 0
97
+ hard_total = 0
98
+
99
+ for sc_info in SCENARIO_LIST:
100
+ sc = get_scenario(sc_info["id"], 42)
101
+ total_len = sum(len(getattr(sc, field, "")) for field in [
102
+ "documentation_data", "training_data_info", "oversight_info",
103
+ "transparency_info", "risk_assessment_info", "logging_info",
104
+ ])
105
+ if sc_info["difficulty"] == "easy":
106
+ easy_total += total_len
107
+ elif sc_info["difficulty"] == "hard":
108
+ hard_total += total_len
109
+
110
+ avg_easy = easy_total / 2 # 2 easy scenarios
111
+ avg_hard = hard_total / 3 # 3 hard scenarios
112
+
113
+ assert avg_hard > avg_easy * 1.3, \
114
+ f"Hard scenarios ({avg_hard:.0f} chars avg) should be significantly larger than easy ({avg_easy:.0f} chars avg)"
115
+
116
+
117
+ # ── Test 5: Randomization changes params but not violations ───────
118
+
119
+ def test_randomization_preserves_violations():
120
+ """Different seeds should change surface parameters (company, date) but
121
+ the same ground truth findings should remain discoverable.
122
+ """
123
+ sc1 = get_scenario("medium_hiring_bias_001", seed=42)
124
+ sc2 = get_scenario("medium_hiring_bias_001", seed=12345)
125
+
126
+ # Ground truth findings must be identical
127
+ assert sc1.ground_truth_findings == sc2.ground_truth_findings
128
+ assert sc1.correct_classification == sc2.correct_classification
129
+
130
+ # At least some parameters must differ across seeds
131
+ params_differ = (
132
+ sc1.get_param("company") != sc2.get_param("company")
133
+ or sc1.get_param("version") != sc2.get_param("version")
134
+ or sc1.get_param("date") != sc2.get_param("date")
135
+ or sc1.get_param("usercount") != sc2.get_param("usercount")
136
+ )
137
+ assert params_differ, "Randomized parameters should differ across seeds"
138
+
139
+
140
+ # ── Test 6: Randomization appears in rendered documents ───────────
141
+
142
+ def test_randomization_in_rendered_documents():
143
+ """Rendered documents should contain randomized parameters, not placeholders."""
144
+ env = ComplianceAuditorEnvironment()
145
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
146
+
147
+ result = json.loads(env._tool_fns["get_system_overview"]())
148
+ content = result["content"]
149
+
150
+ # Should NOT contain raw placeholders
151
+ assert "__COMPANY__" not in content, "Placeholder __COMPANY__ should be replaced"
152
+ assert "__VERSION__" not in content, "Placeholder __VERSION__ should be replaced"
153
+
154
+ # Should contain actual randomized values
155
+ assert "v" in content.lower(), "Should contain version number"
156
+
157
+
158
+ # ── Test 7: Dynamic audit progress appears after findings ─────────
159
+
160
+ def test_dynamic_audit_progress():
161
+ """After submitting findings, subsequent tool calls should include
162
+ audit progress section showing what's been found.
163
+ """
164
+ env = ComplianceAuditorEnvironment()
165
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
166
+
167
+ # Before any findings — no progress section
168
+ r1 = json.loads(env._tool_fns["get_system_overview"]())
169
+ assert "AUDIT PROGRESS" not in r1["content"], \
170
+ "No progress section before any actions"
171
+
172
+ # After classification and a finding
173
+ env._tool_fns["classify_system"](risk_category="high_risk")
174
+ env._tool_fns["check_documentation"]()
175
+ env._tool_fns["submit_finding"](finding="test_finding", severity="high")
176
+
177
+ r2 = json.loads(env._tool_fns["audit_training_data"]())
178
+ assert "AUDIT PROGRESS" in r2["content"], \
179
+ "Progress section should appear after findings submitted"
180
+ assert "test_finding" in r2["content"], \
181
+ "Progress should show submitted findings"
182
+ assert "High Risk" in r2["content"], \
183
+ "Progress should show submitted classification"
184
+
185
+
186
+ # ── Test 8: Each scenario has unique graph topology ───────────────
187
+
188
+ def test_graph_diversity():
189
+ """At least 4 distinct graph topologies across 8 scenarios."""
190
+ sigs = set()
191
+ for sc_info in SCENARIO_LIST:
192
+ sc = get_scenario(sc_info["id"], 42)
193
+ sig = tuple(sorted(
194
+ (t.from_state, t.to_state, t.tool_name, t.outcome)
195
+ for t in sc.graph.transitions
196
+ ))
197
+ sigs.add(sig)
198
+ assert len(sigs) >= 4, f"Only {len(sigs)} unique graph topologies — need at least 4"
199
+
200
+
201
+ # ── Test 9: Prohibited scenario does not reveal classification ────
202
+
203
+ def test_prohibited_scenario_concealment():
204
+ """The prohibited system's overview should NOT reveal it's prohibited.
205
+ The deployer frames it as a wellness tool — agent must discover the truth.
206
+ """
207
+ env = ComplianceAuditorEnvironment()
208
+ env.reset(seed=42, scenario_id="hard_social_scoring_prohibited_001")
209
+
210
+ result = json.loads(env._tool_fns["get_system_overview"]())
211
+ content = result["content"].lower()
212
+
213
+ assert "prohibited" not in content, \
214
+ "Overview should not reveal the system is prohibited — that's what the agent must discover"
215
+ assert "wellness" in content or "civic" in content or "engagement" in content, \
216
+ "Overview should use the deployer's framing (wellness/civic engagement)"
217
+
218
+
219
+ # ── Test 10: All 8 scenarios produce valid tool responses ─────────
220
+
221
+ def test_all_scenarios_produce_rich_responses():
222
+ """Every scenario's investigation tools must return non-trivial content."""
223
+ for sc_info in SCENARIO_LIST:
224
+ env = ComplianceAuditorEnvironment()
225
+ env.reset(seed=42, scenario_id=sc_info["id"])
226
+
227
+ env._tool_fns["get_system_overview"]()
228
+ env._tool_fns["classify_system"](risk_category="high_risk")
229
+
230
+ for tool in ["check_documentation", "audit_training_data",
231
+ "verify_human_oversight", "check_transparency",
232
+ "assess_risk_management", "check_logging"]:
233
+ result = json.loads(env._tool_fns[tool]())
234
+ content = result.get("content", "")
235
+ assert len(content) > 100, \
236
+ f"{sc_info['id']}/{tool}: document too short ({len(content)} chars)"
tests/test_procedural.py ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Procedural scenario generator tests.
2
+
3
+ Proves the generator produces valid, diverse, and unique scenarios
4
+ from any seed. This is the feature that makes the environment INFINITE.
5
+ """
6
+
7
+ import json
8
+ import pytest
9
+ from scenarios.procedural import generate_procedural_scenario, SYSTEM_TEMPLATES, VIOLATION_TEMPLATES
10
+ from scenarios.registry import get_scenario
11
+ from server.environment import ComplianceAuditorEnvironment
12
+
13
+
14
+ def test_generates_valid_scenario():
15
+ """Basic generation produces a well-formed AuditScenario."""
16
+ sc = generate_procedural_scenario(seed=42, difficulty="medium")
17
+ assert sc.scenario_id.startswith("procedural_")
18
+ assert sc.correct_classification in ("prohibited", "high_risk", "limited_risk", "minimal_risk")
19
+ assert len(sc.ground_truth_findings) >= 1
20
+ assert len(sc.graph.nodes) >= 6
21
+ assert sc.graph.optimal_path_length() >= 3
22
+
23
+
24
+ def test_difficulty_controls_violation_count():
25
+ """Easy has fewer violations than hard."""
26
+ easy_counts = []
27
+ hard_counts = []
28
+ for seed in range(20):
29
+ easy = generate_procedural_scenario(seed, "easy")
30
+ hard = generate_procedural_scenario(seed, "hard")
31
+ easy_counts.append(len(easy.ground_truth_findings))
32
+ hard_counts.append(len(hard.ground_truth_findings))
33
+
34
+ assert sum(easy_counts) / len(easy_counts) < sum(hard_counts) / len(hard_counts), \
35
+ "Hard scenarios should have more violations on average"
36
+
37
+
38
+ def test_different_seeds_produce_different_scenarios():
39
+ """No two seeds should produce identical scenarios."""
40
+ scenarios = {}
41
+ for seed in range(50):
42
+ sc = generate_procedural_scenario(seed, "medium")
43
+ key = (sc.system_name, tuple(sc.ground_truth_findings))
44
+ scenarios[seed] = key
45
+
46
+ unique = len(set(scenarios.values()))
47
+ assert unique >= 10, f"Only {unique} unique scenarios from 50 seeds — too little diversity"
48
+
49
+
50
+ def test_prohibited_systems_in_hard_mode():
51
+ """Hard difficulty should sometimes generate prohibited systems."""
52
+ has_prohibited = False
53
+ for seed in range(100):
54
+ sc = generate_procedural_scenario(seed, "hard")
55
+ if sc.correct_classification == "prohibited":
56
+ has_prohibited = True
57
+ break
58
+ assert has_prohibited, "Hard mode should occasionally generate prohibited systems"
59
+
60
+
61
+ def test_procedural_works_in_environment():
62
+ """Procedural scenarios work end-to-end through the environment."""
63
+ for seed in [1, 42, 100]:
64
+ env = ComplianceAuditorEnvironment()
65
+ obs = env.reset(seed=seed, scenario_id=f"procedural_medium_{seed}")
66
+
67
+ assert not env._done
68
+ assert env._scenario is not None
69
+
70
+ # Run basic audit
71
+ r = json.loads(env._tool_fns["get_system_overview"]())
72
+ assert "content" in r
73
+ assert len(r["content"]) > 100
74
+
75
+ env._tool_fns["classify_system"](risk_category="high_risk")
76
+ env._tool_fns["check_documentation"]()
77
+
78
+ result = json.loads(env._tool_fns["verify_compliance"](
79
+ risk_classification="high_risk",
80
+ overall_assessment="test",
81
+ key_findings_summary="test"
82
+ ))
83
+ assert 0.0 < result["reward"] < 1.0
84
+
85
+
86
+ def test_procedural_via_get_scenario():
87
+ """Procedural IDs work through the standard get_scenario interface."""
88
+ sc = get_scenario("procedural_easy_42")
89
+ assert sc.scenario_id.startswith("procedural_")
90
+ assert sc.difficulty == "easy"
91
+
92
+ sc2 = get_scenario("procedural_hard_999")
93
+ assert sc2.difficulty == "hard"
94
+ assert len(sc2.ground_truth_findings) >= len(sc.ground_truth_findings)
95
+
96
+
97
+ def test_all_system_types_reachable():
98
+ """Every system type template should be reachable from some seed."""
99
+ seen_systems = set()
100
+ for seed in range(200):
101
+ for diff in ["easy", "medium", "hard"]:
102
+ sc = generate_procedural_scenario(seed, diff)
103
+ seen_systems.add(sc.system_name.split(" ")[-2] + " " + sc.system_name.split(" ")[-1])
104
+
105
+ assert len(seen_systems) >= len(SYSTEM_TEMPLATES), \
106
+ f"Only {len(seen_systems)} system types reached from 200 seeds — some unreachable"
107
+
108
+
109
+ def test_reward_bounds_procedural():
110
+ """All procedural scenarios produce rewards in (0.001, 0.999)."""
111
+ for seed in range(30):
112
+ for diff in ["easy", "medium", "hard"]:
113
+ env = ComplianceAuditorEnvironment()
114
+ env.reset(seed=seed, scenario_id=f"procedural_{diff}_{seed}")
115
+
116
+ env._tool_fns["get_system_overview"]()
117
+ env._tool_fns["classify_system"](risk_category="high_risk")
118
+
119
+ result = json.loads(env._tool_fns["verify_compliance"](
120
+ risk_classification="high_risk",
121
+ overall_assessment="test",
122
+ key_findings_summary="test"
123
+ ))
124
+ assert 0.0 < result["reward"] < 1.0, \
125
+ f"Reward {result['reward']} out of bounds @ seed={seed} diff={diff}"
tests/test_reward_hacking.py ADDED
@@ -0,0 +1,339 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Adversarial reward-hacking tests.
2
+
3
+ Verifies the 6-component reward function cannot be gamed by common exploits:
4
+
5
+ 1. Skip investigation — submit findings without reading documents
6
+ 2. Spam findings — flood with every possible finding string
7
+ 3. Red herring bait — submit red herrings as findings
8
+ 4. Skip to verify — call verify_compliance immediately
9
+ 5. Wrong classification — still score well on other components
10
+ 6. Skip remediation — submit findings but no fixes
11
+ 7. Classify without overview — skip get_system_overview
12
+ 8. Fewer steps than optimal — game efficiency by rushing
13
+ 9. Hallucinate findings — submit plausible-sounding false findings
14
+ 10. Perfect findings but wrong classification — test component isolation
15
+
16
+ If any exploit scores above the expected cap, the reward function has a hole.
17
+ """
18
+
19
+ from server.engine import compute_reward, safe_reward, RewardBreakdown
20
+ from scenarios.registry import get_scenario
21
+
22
+
23
+ # ── Test helpers ──────────────────────────────────────────────────
24
+
25
+ def _score_scenario(
26
+ scenario_id: str = "medium_hiring_bias_001",
27
+ seed: int = 42,
28
+ classification: str = "",
29
+ findings: list = None,
30
+ remediation: list = None,
31
+ tool_sequence: list = None,
32
+ steps: int = 10,
33
+ ) -> RewardBreakdown:
34
+ sc = get_scenario(scenario_id, seed)
35
+ return compute_reward(
36
+ scenario=sc,
37
+ classification_submitted=classification,
38
+ findings_submitted=findings or [],
39
+ remediation_submitted=remediation or [],
40
+ tool_sequence=tool_sequence or [],
41
+ steps_taken=steps,
42
+ )
43
+
44
+
45
+ def _total(scenario_id="medium_hiring_bias_001", **kwargs):
46
+ return _score_scenario(scenario_id, **kwargs).total()
47
+
48
+
49
+ # ── Exploit 1: Skip investigation, submit correct findings ────────
50
+
51
+ def test_skip_investigation_penalizes_methodology():
52
+ """Agent submits perfect findings without calling any investigation tools.
53
+ Must lose methodology score (15%) — no investigation tools in sequence.
54
+ """
55
+ r = _score_scenario(
56
+ classification="high_risk",
57
+ findings=[
58
+ "gender_bias_in_technical_screening",
59
+ "insufficient_human_oversight",
60
+ "incomplete_technical_documentation",
61
+ "missing_fundamental_rights_impact_assessment",
62
+ "inadequate_data_governance",
63
+ ],
64
+ remediation=[
65
+ "conduct_bias_audit_and_mitigation",
66
+ "implement_human_review_for_all_rejections",
67
+ "complete_annex_iv_technical_documentation",
68
+ "perform_fundamental_rights_impact_assessment",
69
+ "establish_data_governance_framework",
70
+ ],
71
+ tool_sequence=["submit_finding"] * 5 + ["recommend_fix"] * 5 + ["verify_compliance"],
72
+ steps=11,
73
+ )
74
+ assert r.methodology == 0.0, f"methodology should be 0 without investigation tools: {r.methodology}"
75
+ # Even with perfect findings, losing methodology (15%) is significant
76
+ assert r.total() < 0.86, f"skip-investigation still scores too high: {r.total()}"
77
+
78
+
79
+ # ── Exploit 2: Spam every possible finding string ─────────────────
80
+
81
+ def test_spam_findings_penalizes_precision():
82
+ """Agent submits 20 findings hoping to hit all ground truth by volume.
83
+ False positives should destroy finding_precision (15%).
84
+ """
85
+ r = _score_scenario(
86
+ classification="high_risk",
87
+ findings=[
88
+ "gender_bias_in_technical_screening", # real
89
+ "insufficient_human_oversight", # real
90
+ "incomplete_technical_documentation", # real
91
+ "missing_fundamental_rights_impact_assessment", # real
92
+ "inadequate_data_governance", # real
93
+ # False positives below
94
+ "prohibited_social_scoring", # red herring
95
+ "biometric_processing", # red herring
96
+ "gdpr_violation", "data_breach", "missing_ce_marking",
97
+ "environmental_impact", "accessibility_violation",
98
+ "patent_infringement", "labor_law_violation",
99
+ "antitrust_concern", "tax_evasion",
100
+ ],
101
+ tool_sequence=["get_system_overview", "classify_system"] + ["submit_finding"] * 16,
102
+ steps=18,
103
+ )
104
+ # 5 true positives out of 16 = low precision
105
+ assert r.finding_precision < 0.5, f"spam precision too high: {r.finding_precision}"
106
+
107
+
108
+ # ── Exploit 3: Submit red herrings as findings ────────────────────
109
+
110
+ def test_red_herring_submission_penalized():
111
+ """Agent falls for red herrings and submits them as findings.
112
+ Must be penalized via finding_precision component.
113
+ """
114
+ r = _score_scenario(
115
+ classification="high_risk",
116
+ findings=[
117
+ "prohibited_social_scoring", # red herring
118
+ "biometric_processing", # red herring
119
+ ],
120
+ tool_sequence=["get_system_overview", "classify_system", "submit_finding", "submit_finding"],
121
+ steps=4,
122
+ )
123
+ # Two false positives, zero true positives
124
+ assert r.finding_precision == 0.0, f"red herrings should give 0 precision: {r.finding_precision}"
125
+ assert r.finding_completeness == 0.0, f"no real findings means 0 completeness: {r.finding_completeness}"
126
+ assert r.total() < 0.35, f"red-herring-only agent scores too high: {r.total()}"
127
+
128
+
129
+ # ── Exploit 4: Skip straight to verify_compliance ─────────────────
130
+
131
+ def test_immediate_verify_scores_near_floor():
132
+ """Agent calls verify_compliance with empty inputs on step 1.
133
+ Should score near the minimum (only safe_reward floor).
134
+ """
135
+ r = _score_scenario(
136
+ classification="",
137
+ findings=[],
138
+ remediation=[],
139
+ tool_sequence=["verify_compliance"],
140
+ steps=1,
141
+ )
142
+ assert r.classification == 0.0
143
+ assert r.finding_completeness == 0.0
144
+ assert r.finding_precision == 0.0
145
+ assert r.remediation == 0.0
146
+ assert r.methodology == 0.0
147
+ assert r.total() < 0.05, f"empty verify should be near floor: {r.total()}"
148
+
149
+
150
+ # ── Exploit 5: Wrong classification but perfect everything else ────
151
+
152
+ def test_wrong_classification_costs_20_percent():
153
+ """Agent gets everything right except classification.
154
+ Must lose the full 20% classification weight.
155
+ """
156
+ r_correct = _score_scenario(
157
+ classification="high_risk",
158
+ findings=["gender_bias_in_technical_screening", "insufficient_human_oversight",
159
+ "incomplete_technical_documentation", "missing_fundamental_rights_impact_assessment",
160
+ "inadequate_data_governance"],
161
+ remediation=["conduct_bias_audit_and_mitigation", "implement_human_review_for_all_rejections",
162
+ "complete_annex_iv_technical_documentation", "perform_fundamental_rights_impact_assessment",
163
+ "establish_data_governance_framework"],
164
+ tool_sequence=["get_system_overview", "classify_system", "check_documentation",
165
+ "audit_training_data", "verify_human_oversight", "check_transparency",
166
+ "assess_risk_management", "check_logging", "submit_finding", "submit_finding",
167
+ "submit_finding", "submit_finding", "submit_finding",
168
+ "recommend_fix", "recommend_fix", "recommend_fix",
169
+ "recommend_fix", "recommend_fix", "verify_compliance"],
170
+ steps=19,
171
+ )
172
+ r_wrong = _score_scenario(
173
+ classification="minimal_risk", # WRONG — should be high_risk
174
+ findings=["gender_bias_in_technical_screening", "insufficient_human_oversight",
175
+ "incomplete_technical_documentation", "missing_fundamental_rights_impact_assessment",
176
+ "inadequate_data_governance"],
177
+ remediation=["conduct_bias_audit_and_mitigation", "implement_human_review_for_all_rejections",
178
+ "complete_annex_iv_technical_documentation", "perform_fundamental_rights_impact_assessment",
179
+ "establish_data_governance_framework"],
180
+ tool_sequence=["get_system_overview", "classify_system", "check_documentation",
181
+ "audit_training_data", "verify_human_oversight", "check_transparency",
182
+ "assess_risk_management", "check_logging", "submit_finding", "submit_finding",
183
+ "submit_finding", "submit_finding", "submit_finding",
184
+ "recommend_fix", "recommend_fix", "recommend_fix",
185
+ "recommend_fix", "recommend_fix", "verify_compliance"],
186
+ steps=19,
187
+ )
188
+ gap = r_correct.total() - r_wrong.total()
189
+ assert gap >= 0.10, f"wrong classification gap too small: {gap:.4f} (correct={r_correct.total():.4f}, wrong={r_wrong.total():.4f})"
190
+
191
+
192
+ # ── Exploit 6: Perfect findings but zero remediation ──────────────
193
+
194
+ def test_no_remediation_loses_15_percent():
195
+ """Agent identifies all findings but proposes no remediation.
196
+ Must lose the full 15% remediation weight.
197
+ """
198
+ r = _score_scenario(
199
+ classification="high_risk",
200
+ findings=["gender_bias_in_technical_screening", "insufficient_human_oversight",
201
+ "incomplete_technical_documentation", "missing_fundamental_rights_impact_assessment",
202
+ "inadequate_data_governance"],
203
+ remediation=[], # no remediation!
204
+ tool_sequence=["get_system_overview", "classify_system", "check_documentation",
205
+ "audit_training_data", "verify_human_oversight", "check_transparency",
206
+ "assess_risk_management", "check_logging",
207
+ "submit_finding", "submit_finding", "submit_finding",
208
+ "submit_finding", "submit_finding", "verify_compliance"],
209
+ steps=14,
210
+ )
211
+ assert r.remediation == 0.0, f"no remediation should give 0: {r.remediation}"
212
+
213
+
214
+ # ── Exploit 7: Classify without overview ──────────────────────────
215
+
216
+ def test_classify_before_overview_penalizes_methodology():
217
+ """Agent classifies before gathering system overview.
218
+ Investigation order should be penalized in methodology.
219
+ """
220
+ r = _score_scenario(
221
+ classification="high_risk",
222
+ findings=["gender_bias_in_technical_screening"],
223
+ tool_sequence=["classify_system", "get_system_overview", "submit_finding"],
224
+ steps=3,
225
+ )
226
+ # classify_system before get_system_overview is an order violation
227
+ assert r.methodology < 0.5, f"wrong order methodology too high: {r.methodology}"
228
+
229
+
230
+ # ── Exploit 8: Fewer steps than optimal games efficiency ──────────
231
+
232
+ def test_fewer_steps_than_optimal_penalized():
233
+ """Agent takes fewer steps than the optimal path.
234
+ This means skipping investigation — efficiency should be penalized.
235
+ """
236
+ r_rushed = _score_scenario(
237
+ classification="high_risk",
238
+ findings=["gender_bias_in_technical_screening"],
239
+ tool_sequence=["verify_compliance"],
240
+ steps=2, # way fewer than optimal
241
+ )
242
+ r_proper = _score_scenario(
243
+ classification="high_risk",
244
+ findings=["gender_bias_in_technical_screening"],
245
+ tool_sequence=["get_system_overview", "classify_system", "check_documentation",
246
+ "audit_training_data", "verify_human_oversight",
247
+ "check_transparency", "assess_risk_management", "check_logging",
248
+ "submit_finding", "verify_compliance"],
249
+ steps=12,
250
+ )
251
+ assert r_proper.efficiency > r_rushed.efficiency, \
252
+ f"rushed ({r_rushed.efficiency}) should not beat proper ({r_proper.efficiency})"
253
+
254
+
255
+ # ── Exploit 9: Hallucinate plausible-sounding findings ────────────
256
+
257
+ def test_hallucinated_findings_low_precision():
258
+ """Agent submits plausible-sounding but wrong findings.
259
+ Token-based matching should not match these.
260
+ """
261
+ r = _score_scenario(
262
+ classification="high_risk",
263
+ findings=[
264
+ "ai_model_lacks_interpretability",
265
+ "no_audit_trail_for_decisions",
266
+ "potential_discrimination_in_outputs",
267
+ "insufficient_testing_methodology",
268
+ ],
269
+ tool_sequence=["get_system_overview", "classify_system", "check_documentation",
270
+ "submit_finding", "submit_finding", "submit_finding", "submit_finding"],
271
+ steps=7,
272
+ )
273
+ # These don't token-match the ground truth findings
274
+ assert r.finding_completeness < 0.4, f"hallucinated findings match too well: {r.finding_completeness}"
275
+
276
+
277
+ # ── Exploit 10: Perfect on prohibited scenario with wrong class ───
278
+
279
+ def test_prohibited_classified_as_high_risk():
280
+ """Agent correctly finds violations but classifies prohibited as high_risk.
281
+ Partial classification match should give 40% credit, not full.
282
+ """
283
+ r = _score_scenario(
284
+ scenario_id="hard_social_scoring_prohibited_001",
285
+ classification="high_risk", # wrong — should be prohibited
286
+ findings=["prohibited_social_scoring_system", "disguised_as_voluntary_wellness",
287
+ "affects_access_to_public_services"],
288
+ tool_sequence=["get_system_overview", "classify_system", "submit_finding",
289
+ "submit_finding", "submit_finding", "verify_compliance"],
290
+ steps=6,
291
+ )
292
+ assert r.classification == 0.4, f"adjacent classification should be 0.4: {r.classification}"
293
+
294
+
295
+ # ── Sanity: perfect run on medium hiring ──────────────────────────
296
+
297
+ def test_perfect_run_scores_high():
298
+ """A perfect audit should score above 0.90."""
299
+ r = _score_scenario(
300
+ classification="high_risk",
301
+ findings=["gender_bias_in_technical_screening", "insufficient_human_oversight",
302
+ "incomplete_technical_documentation", "missing_fundamental_rights_impact_assessment",
303
+ "inadequate_data_governance"],
304
+ remediation=["conduct_bias_audit_and_mitigation", "implement_human_review_for_all_rejections",
305
+ "complete_annex_iv_technical_documentation",
306
+ "perform_fundamental_rights_impact_assessment",
307
+ "establish_data_governance_framework"],
308
+ tool_sequence=["get_system_overview", "classify_system", "check_documentation",
309
+ "audit_training_data", "verify_human_oversight", "check_transparency",
310
+ "assess_risk_management", "check_logging",
311
+ "submit_finding", "submit_finding", "submit_finding",
312
+ "submit_finding", "submit_finding",
313
+ "recommend_fix", "recommend_fix", "recommend_fix",
314
+ "recommend_fix", "recommend_fix",
315
+ "verify_compliance"],
316
+ steps=19,
317
+ )
318
+ assert r.total() > 0.85, f"perfect run too low: {r.total()}"
319
+ assert r.classification == 1.0
320
+ assert r.methodology > 0.8
321
+
322
+
323
+ # ── Bounds: all rewards strictly in (0, 1) ────────────────────────
324
+
325
+ def test_reward_bounds_all_scenarios():
326
+ """Every scenario × various inputs must produce reward in (0.001, 0.999)."""
327
+ from scenarios.registry import SCENARIO_LIST
328
+ for sc_info in SCENARIO_LIST:
329
+ for cls in ["", "prohibited", "high_risk", "limited_risk", "minimal_risk", "garbage"]:
330
+ for findings in [[], ["some_finding"], ["a", "b", "c", "d", "e", "f"]]:
331
+ r = _total(
332
+ scenario_id=sc_info["id"],
333
+ classification=cls,
334
+ findings=findings,
335
+ tool_sequence=["verify_compliance"],
336
+ steps=1,
337
+ )
338
+ assert 0.0 < r < 1.0, \
339
+ f"out of range: {r} @ {sc_info['id']} cls={cls} findings={len(findings)}"
tests/test_stress.py ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Stress tests — prove robustness across many random seeds and scenarios.
2
+
3
+ Runs 50 seeds × 9 scenarios = 450 episodes to verify:
4
+ 1. Every scenario instantiates without error
5
+ 2. Every tool returns valid JSON with content
6
+ 3. Every reward is strictly in (0.001, 0.999)
7
+ 4. No two seeds produce identical documents (randomization works)
8
+ 5. State graphs are consistent across seeds
9
+ 6. Adaptive depth works across all scenarios
10
+ """
11
+
12
+ import json
13
+ import pytest
14
+ from server.environment import ComplianceAuditorEnvironment
15
+ from scenarios.registry import get_scenario, SCENARIO_LIST
16
+
17
+
18
+ SEEDS = list(range(1, 51)) # 50 seeds
19
+
20
+
21
+ @pytest.mark.parametrize("scenario_info", SCENARIO_LIST, ids=lambda s: s["id"])
22
+ def test_all_seeds_produce_valid_episodes(scenario_info):
23
+ """Every seed × scenario produces valid tool responses and bounded reward."""
24
+ sid = scenario_info["id"]
25
+ seen_overviews = set()
26
+
27
+ for seed in SEEDS:
28
+ env = ComplianceAuditorEnvironment()
29
+ env.reset(seed=seed, scenario_id=sid)
30
+
31
+ # Call overview
32
+ overview = json.loads(env._tool_fns["get_system_overview"]())
33
+ assert "content" in overview, f"seed={seed}: overview missing content"
34
+ assert len(overview["content"]) > 50, f"seed={seed}: overview too short"
35
+ # Use a section that contains randomized params (company, version appear in middle)
36
+ seen_overviews.add(overview["content"][50:200])
37
+
38
+ # Classify
39
+ env._tool_fns["classify_system"](risk_category="high_risk")
40
+
41
+ # Call one investigation tool
42
+ doc = json.loads(env._tool_fns["check_documentation"]())
43
+ assert "content" in doc, f"seed={seed}: doc missing content"
44
+ assert len(doc["content"]) > 100, f"seed={seed}: doc too short"
45
+
46
+ # Submit finding + verify
47
+ env._tool_fns["submit_finding"](finding="test_finding")
48
+ result = json.loads(env._tool_fns["verify_compliance"](
49
+ risk_classification="high_risk",
50
+ overall_assessment="test",
51
+ key_findings_summary="test"
52
+ ))
53
+
54
+ reward = result["reward"]
55
+ assert 0.0 < reward < 1.0, f"seed={seed}: reward {reward} out of bounds"
56
+
57
+ # Randomization: across 50 seeds, we should see at least 3 unique overviews
58
+ assert len(seen_overviews) >= 3, \
59
+ f"Only {len(seen_overviews)} unique overviews across 50 seeds — randomization may be broken"
60
+
61
+
62
+ @pytest.mark.parametrize("scenario_info", SCENARIO_LIST, ids=lambda s: s["id"])
63
+ def test_graph_consistency_across_seeds(scenario_info):
64
+ """State graph topology must be identical regardless of seed."""
65
+ sid = scenario_info["id"]
66
+ base_graph = None
67
+ for seed in [1, 42, 100, 999]:
68
+ sc = get_scenario(sid, seed)
69
+ sig = tuple(sorted(
70
+ (t.from_state, t.to_state, t.tool_name, t.outcome)
71
+ for t in sc.graph.transitions
72
+ ))
73
+ if base_graph is None:
74
+ base_graph = sig
75
+ else:
76
+ assert sig == base_graph, f"Graph differs at seed={seed}"
77
+
78
+
79
+ def test_adaptive_depth_on_medium_hiring():
80
+ """Repeat calls reveal deeper content on the flagship scenario."""
81
+ env = ComplianceAuditorEnvironment()
82
+ env.reset(seed=42, scenario_id="medium_hiring_bias_001")
83
+ env._tool_fns["get_system_overview"]()
84
+ env._tool_fns["classify_system"](risk_category="high_risk")
85
+
86
+ r1 = json.loads(env._tool_fns["audit_training_data"]())
87
+ r2 = json.loads(env._tool_fns["audit_training_data"]())
88
+
89
+ assert len(r2["content"]) > len(r1["content"]), \
90
+ "Second call should reveal deeper content"
91
+ assert "DEEP DIVE" in r2["content"], \
92
+ "Second call should contain forensic deep dive"
93
+ assert "note" in r2, \
94
+ "Second call should have a note about deep dive"