---
title: EU AI Act Compliance Auditor
emoji: "🏛"
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
tags:
  - openenv
---

# EU AI Act Compliance Auditor

An MCP environment where LLM agents audit AI systems for EU AI Act compliance. Tools return **investigation-grade regulatory documents** — statistical tables, documentation inventories, operational procedures — that require genuine analysis to identify violations. No pre-digested verdicts. The agent must reason about evidence across documents to find compliance gaps.

## What Makes This Different

Most compliance environments hand the agent pre-labeled answers: `"bias_assessment": "FAILED"`. This environment returns the **raw evidence**:

```
CALLBACK RATES BY DEMOGRAPHIC (Technical Roles Only):
  Group               Rate     vs Baseline
  Male applicants     34.2%    (baseline)
  Female applicants   26.3%    -23.1%
  Eastern EU          27.4%    -19.9%
```

The agent must identify the 23% callback disparity from the table, recognize it as gender bias, cross-reference with the oversight document showing only 5% of rejections are reviewed, and connect these into actionable findings.

## Stats

| Metric | Value |
|--------|-------|
| Fixed Scenarios | 9 across 3 difficulty tiers |
| Procedural Scenarios | Infinite (seed-based generation) |
| MCP Tools | 11 (8 investigation + 3 resolution) |
| Reward Components | 6 (weighted, anti-gaming) |
| Graph Topologies | 6 unique per-scenario |
| Document Depth | 500-3,275 chars per tool response |
| Total Document Content | 77K+ chars across all scenarios |
| Anti-Gaming Tests | 12 adversarial exploits proven ineffective |
| Test Suite | 74 tests across 8 files |
| Adaptive Depth | Repeat tool calls reveal forensic deep-dive |
| Dynamic State | Environment reacts to findings and remediations |
| Parameter Randomization | Company, region, version, dates per reset |

## Scenarios

### Easy (2) — Clear-cut systems, focused investigation
- **Customer Service Chatbot** — Limited-risk. Missing AI disclosure under Article 50. Agent checks transparency and oversight.
- **Music Recommendation Engine** — Minimal-risk. Voluntary code of conduct recommended. Short investigation path.

### Medium (3) — Statistical evidence, red herrings, multi-article violations
- **AI Resume Screener** — High-risk hiring AI (Annex III). 5 findings: gender bias (23% callback gap), insufficient oversight (5% review rate), missing FRIA, incomplete Annex IV docs, data governance gaps.
- **Credit Scoring Model** — High-risk fintech. Opaque alternative data features (social media, device metadata), no right to human review, missing conformity assessment.
- **Emergency Triage AI** — Medical device dual-regulation (MDR + AI Act). Age bias in 75+ cohort (76.3% sensitivity), retrospective-only validation, no real-time monitoring.
- **Workplace Emotion Recognition** — **PROHIBITED** under Article 5(1)(f). Webcam-based "engagement analytics" that's actually emotion recognition. Deployer frames it as productivity tool — agent must recognize it processes biometric data (facial action units, micro-expressions) without medical/safety exception.

### Hard (3) — Disguised systems, compound risks, multi-system dependencies
- **Citizen Wellness App** — **PROHIBITED** social scoring disguised as voluntary wellness tool. Deployer frames it as gamification, but investigation reveals it controls access to public services based on social behavior scores. Agent must see through the framing.
- **AI Content Studio** — Deepfake generation platform. Missing all Article 50 content labeling, no C2PA watermarking, no content provenance. Political content generated without disclosure.
- **Corporate AI Portfolio** — 4 interconnected AI systems sharing a data lake. Agent must identify cross-system data flows amplifying risks, recognize employee sentiment analysis as high-risk, and spot biometric categorization in safety monitoring.

## Procedural Scenario Generator

Beyond the 9 hand-crafted scenarios, a seed-based procedural generator produces **infinite unique scenarios** by combining:

- **5 system types**: Drone delivery (critical infrastructure), exam proctoring (education), insurance adjudication (essential services), legal research (limited risk), predictive policing (prohibited)
- **16 violation templates**: Gender bias, age discrimination, data governance gaps, missing conformity, logging inadequacies, and more
- **5 red herring templates**: GDPR confusion, compliant sibling systems, ISO certifications, voluntary ethics boards

```python
# Any seed produces a unique, coherent scenario
env.reset(scenario_id="procedural_medium_42")   # Seed 42, medium difficulty
env.reset(scenario_id="procedural_hard_12345")  # Seed 12345, hard difficulty
```

Each generated scenario has proper ground truth findings, matching state graph, violation-specific documents, and is fully compatible with the 6-component reward function.

## Action & Observation Spaces

### Action (ComplianceAction)
```python
class ComplianceAction(Action):
    tool_name: str    # Name of the audit tool to call
    arguments: dict   # Tool arguments as JSON (e.g. {"risk_category": "high_risk"})
```

### Observation (ComplianceObservation)
```python
class ComplianceObservation(Observation):
    done: bool          # Whether the episode is complete
    reward: float       # Current step reward (terminal on verify_compliance)
    metadata: dict      # Tool response content, audit context
    queries_remaining: int
```

### State (ComplianceState)
```python
class ComplianceState(BaseModel):
    episode_id: str
    step_count: int
    scenario_id: str
    difficulty: str           # easy / medium / hard
    queries_used: int
    findings_count: int
    compliance_verified: bool
    current_reward: float
```

## Tools

### Investigation
| Tool | Returns |
|------|---------|
| `get_system_overview` | Formal audit assignment brief with system description and deployment context |
| `classify_system` | Records risk classification (prohibited / high_risk / limited_risk / minimal_risk) |
| `check_documentation` | Annex IV cross-reference table with per-section compliance status |
| `audit_training_data` | Demographic statistics tables, data governance assessment, bias indicators |
| `verify_human_oversight` | Operational procedures extract with review statistics and override capabilities |
| `check_transparency` | User-facing UI/ToS text analysis with Article 50 compliance indicators |
| `assess_risk_management` | Risk register, conformity assessment tracker, Annex III classification analysis |
| `check_logging` | Audit log schema, Article 12 requirements gap analysis |

### Resolution
| Tool | Purpose |
|------|---------|
| `submit_finding` | Report a compliance violation (call once per finding) |
| `recommend_fix` | Propose remediation with priority |
| `verify_compliance` | Final determination — triggers terminal 6-component reward |

## 6-Component Reward

| Component | Weight | Anti-Gaming |
|-----------|--------|-------------|
| Classification | 20% | Adjacent-category partial credit (40%). Wrong by 2+ categories = 0. |
| Finding Completeness | 25% | Token-based fuzzy matching (Jaccard 40%, min 2 tokens). Prevents keyword stuffing. |
| Finding Precision | 15% | Red herring submissions penalized 15% each. False positives reduce score. |
| Remediation Quality | 15% | Presence (70%) + priority ordering (30%). Missing remediation = 0. |
| Methodology | 15% | Order violations penalized. Skipping investigation tools = 0. |
| Efficiency | 10% | Fewer steps than optimal = penalty (skipping investigation). More steps = diminishing returns. |

All rewards clamped to (0.001, 0.999). 12 adversarial tests prove robustness.

## Architecture

```
compliance_env/
  server/
    environment.py      # MCP environment, 11 tools, dynamic audit state
    engine.py           # State graph + 6-component reward computation
    app.py              # FastAPI + HTTP session API + Gradio UI
    gradio_landing.py   # 7-tab dashboard with investigation depth showcase
  scenarios/
    registry.py         # 8 scenarios with 77K+ chars of investigation documents
  tests/
    test_environment.py       # 14 environment + API tests
    test_reward_hacking.py    # 12 adversarial anti-gaming tests
    test_investigation_depth.py # 10 investigation quality tests
  inference.py          # OpenAI function-calling baseline agent
  client.py             # Zero-dependency HTTP client
  models.py             # Pydantic observation/state models
  Dockerfile            # python:3.11-slim, port 7860
  openenv.yaml          # OpenEnv manifest with tasks
```

## Quick Start

```bash
# Install
pip install "openenv-core[core]" fastmcp gradio httpx openai

# Run locally
uvicorn server.app:app --host 0.0.0.0 --port 7860

# Run inference (NVIDIA NIM)
export API_BASE_URL="https://integrate.api.nvidia.com/v1"
export MODEL_NAME="stepfun-ai/step-3.5-flash"
export HF_TOKEN="nvapi-..."
python inference.py --space https://Itachi1824-compliance-auditor-env.hf.space

# Docker
docker build -t compliance-env . && docker run -p 7860:7860 compliance-env

# Tests
pytest tests/ -v
```

## API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/reset` | POST | Create session, returns tools + initial observation |
| `/api/call_tool` | POST | Call an audit tool in an active session |
| `/api/close` | POST | End session and cleanup |
| `/tasks` | GET | List available scenarios |
| `/grader` | POST | Grade a completed episode |
| `/health` | GET | Health check |