An MCP-based environment where LLM agents audit AI systems for EU AI Act compliance β from risk classification to violation identification to remediation planning. Scenarios based on real regulatory articles. Parameter randomization on every reset prevents memorization; agents must learn the audit process, not specific answers.
Why This Environment
The EU AI Act's major enforcement deadline is August 2, 2026 β less than 4 months away. Every company deploying AI in Europe faces fines up to EUR 35 million or 7% of global revenue. Yet no automated compliance auditing benchmark exists. This environment fills that gap with 8 realistic scenarios across the full spectrum of EU AI Act risk categories.
Stats
Metric
Value
Scenarios
8
MCP Tools
11
Reward Components
6
Difficulty Tiers
3 (easy / medium / hard)
State Graph Nodes
12 per scenario
Parameter Randomization
Company, region, version, dates per reset
Tools (MCP Interface)
Investigation
Tool
Description
get_system_overview
Gather system description, deployer info, deployment context
Review Annex IV technical documentation completeness
audit_training_data
Check bias, representativeness, data governance (Article 10)
verify_human_oversight
Verify Article 14 human-in-the-loop mechanisms
check_transparency
Check Article 50 transparency obligations
assess_risk_management
Review risk management system (Article 9)
check_logging
Verify automatic logging and traceability (Article 12)
Resolution
Tool
Description
submit_finding
Report a compliance violation (call per finding)
recommend_fix
Propose remediation with priority
verify_compliance
Final determination β triggers terminal reward
Scenarios
Easy
Customer Service Chatbot β Limited-risk system missing AI disclosure (Article 50)
Music Recommendation Engine β Minimal-risk system needing voluntary code of conduct
Medium
AI Resume Screener β High-risk hiring AI (Annex III) with gender bias, missing oversight, incomplete documentation
Credit Scoring Model β High-risk fintech system with opaque features and no right to human review
Emergency Triage AI β Medical device with age bias and no prospective clinical validation
Hard
Citizen Wellness App β PROHIBITED social scoring system disguised as a voluntary wellness tool. Must identify it as prohibited under Article 5(1)(c)
AI Content Studio β Deepfake generation platform missing all Article 50 transparency obligations
Corporate AI Portfolio β Multi-system audit with 4 interconnected AI systems sharing a data lake. Must identify compound risks and cross-system data flow issues
6-Component Reward
Component
Weight
Description
Classification
20%
Correct risk category identification
Finding Completeness
25%
Recall of ground-truth violations
Finding Precision
15%
Penalty for false positives / red herring findings