# ๐Ÿข GridMind-RL โ€” Energy Management Reinforcement Learning Environment **A real-world RL environment for intelligent building energy optimization.** Control HVAC systems, thermal storage, batch job scheduling, and demand-response under stochastic electricity prices and grid stress events. Built on the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) specification. Containerized. Ready for Hugging Face Spaces deployment. --- ## ๐Ÿ“– Overview & Motivation Building energy management is a **real-world optimization problem** facing utilities, facility operators, and industrial sites globally. Traditional rule-based controls waste billions in energy costs and miss opportunities for grid participation. **GridMind-RL** simulates decisions that facility operators must make daily: - **Cost Optimization** โ€” Buy electricity when prices are low, avoid peak surcharges - **Comfort & Safety** โ€” Maintain indoor temperature within acceptable ranges while managing thermal inertia - **Grid Participation** โ€” Respond to demand-response signals and grid stress events - **Batch Scheduling** โ€” Coordinate industrial process timings to meet deadlines and minimize energy cost - **Carbon Minimization** โ€” Shift consumption to periods when grid carbon intensity is low **Why this matters:** An RL agent trained in this environment can learn strategies that would be difficult or impossible for humans to hand-craft. The combination of continuous control (HVAC power, thermal storage), discrete decisions (batch scheduling), and multiple simultaneous objectives (cost, comfort, grid, deadlines, carbon) creates a realistic, challenging benchmark. **Episode Length:** 96 steps = 24 hours at 15-minute resolution. A complete episode requires strategic decision-making across a full day-night cycle. --- ## ๏ฟฝ Observation Space At each timestep, the environment provides the following observations. **Episode length: 96 steps** (15-minute intervals = 24 hours). | Field | Data Type | Range / Values | Description | |-------|-----------|-----------------|-------------| | `indoor_temperature` | float | 10โ€“40 ยฐC | Current building interior temperature | | `thermal_storage_level` | float | 0.0โ€“1.0 | Thermal tank charge state (0 = empty, 1 = full) | | `process_demand` | float | โ‰ฅ 0 kW | Current industrial batch process power draw | | `current_price` | float | > 0 $/kWh | Real-time spot electricity price | | `grid_stress_signal` | float | 0.0โ€“1.0 | Utility demand-response urgency (0.7+ = critical) | | `carbon_intensity` | float | โ‰ฅ 0 gCOโ‚‚/kWh | Current grid carbon intensity | | `hour_of_day` | int | 0โ€“23 | Time-of-day context | | `batch_queue` | int array | โ€” | Pending batch jobs with deadline slots | | `cumulative_cost` | float | โ‰ฅ 0 $ | Energy cost accumulated in current episode so far | | `step` | int | 0โ€“95 | Current timestep (96 total = 24 hours) | | `building_id` | int | 0+ | Building identifier (for multi-building scenarios) | **Observation Properties:** - Observations are **deterministic** given the seed โ€” same seed produces identical sequences - All fields are **normalized or bounded** for stable learning - Prices follow realistic time-of-use patterns; carbon intensity varies with grid mix - Batch queue starts empty; jobs appear stochastically based on the task/seed --- ## ๐ŸŽฎ Action Space At each step, the agent sends an action controlling four independent subsystems: | Field | Data Type | Range | Description | |-------|-----------|-------|-------------| | `hvac_power_level` | float | 0.0โ€“1.0 | HVAC system power (0 = off, 1 = full) | | `thermal_charge_rate` | float | -1.0โ€“1.0 | Thermal storage control (+charge, -discharge) | | `batch_job_slot` | int | 0โ€“4 | Schedule next batch job: 0=immediate, 1โ€“4=defer | | `load_shed_fraction` | float | 0.0โ€“0.5 | Non-critical load reduction (0โ€“50%) for demand-response | | `building_id` | int | 0+ | Building identifier (routing) | **Action Space Properties:** - **Continuous** (HVAC, thermal charging, load shedding) + **discrete** (batch scheduling) โ†’ hybrid control - Actions are applied every 15-minute step - Load shedding is capped at 50% to ensure safety/habitability - Batch scheduling decisions affect energy cost and deadline compliance --- ## ๐Ÿ’ก Reward Function The environment provides **dense rewards every step** (not sparse, not binary). Each step returns: - A scalar reward (sum of components) - A dictionary of 7 weighted sub-components for transparency | Component | Purpose | Possible Values | |-----------|---------|-----------------| | **cost_savings** | Minimize energy bill | Negative (cost increases) to positive (savings vs baseline) | | **temp_constraint** | Maintain comfort | Gaussian bonus near 21ยฐC, penalty outside 19โ€“23ยฐC bounds | | **grid_response** | Shift load during stress | Bonus proportional to shed fraction when grid signal > 0.7 | | **efficiency_bonus** | Exploit thermal storage | Reward charge/discharge timing and thermal arbitrage | | **stability_penalty** | Smooth control | Small penalty for rapid oscillations in HVAC/storage | | **deadline_penalty** | Meet job deadlines | Large penalty if batch job finishes after deadline | | **carbon_reward** | Low-carbon consumption | Bonus for consuming during low-carbon grid periods | **Example Reward Calculation:** If an agent takes a well-timed action during high-price, high-stress period: - Large positive `cost_savings` (avoided expensive hour) - Positive `grid_response` (shed load successfully) - Possible positive `carbon_reward` (if grid is clean) - **Total step reward** = weighted sum of all components This multi-objective reward structure encourages **learning tradeoffs** between cost, comfort, grid support, and carbon efficiency. --- --- ## ๐Ÿ“‹ Tasks & Difficulty Levels Three independent tasks with **deterministic programmatic graders**. Scores range **0.0โ€“1.0**; higher is better. ### Task 1 โ€” Cost Minimization (๐ŸŸข Easy) **Objective:** Minimize total energy cost in 24 hours with no other constraints. **Difficulty Rationale:** Only one objective (cost) to optimize; temperature and grid constraints are relaxed. **Grader Metrics:** - **Cost score (100%)** โ€” Compares total episode energy cost to a deterministic baseline. Higher savings โ†’ higher score. **Baseline Score:** **0.7063** --- ### Task 2 โ€” Constrained Temperature Control (๐ŸŸก Medium) **Objective:** Minimize cost while maintaining indoor temperature between **19โ€“23ยฐC** throughout the episode. **Difficulty Rationale:** Introduces a hard constraint (temperature bounds). Agent must use thermal storage strategically to meet both cost and comfort goals. **Grader Metrics:** - **Cost score (60%)** โ€” Total energy cost vs baseline - **Temperature score (40%)** โ€” Fraction of steps within bounds (hard penalty for violations) **Notes:** A naive agent might achieve low cost by disabling HVAC, but then temperatures drift out of bounds (0 score). Trade-off learning is required. **Baseline Score:** **0.6333** --- ### Task 3 โ€” Full Demand Response (๐Ÿ”ด Hard) **Objective:** Minimize cost, maintain temperature, respond to grid events, complete batch jobs on time, and minimize carbon emissions. This is a **multi-objective constraint satisfaction** problem. **Difficulty Rationale:** Most realistic. Agent must balance five competing objectives simultaneously; any single failure is costly. **Grader Metrics:** - **Cost score (28%)** โ€” Energy cost - **Temperature score (20%)** โ€” Time within comfort bounds - **Grid response score (20%)** โ€” Load shed during demand-response events (signal > 0.7) - **Batch deadline score (12%)** โ€” Fraction of jobs completed before deadline - **Carbon reward score (20%)** โ€” Shift load to low-carbon periods **Baseline Breakdown:** - Cost: 0.670, Temperature: 0.573, Grid: 0.214, Batch: 1.000, Carbon: 0.657 - **Overall: 0.5966** **Challenge:** Grid response score (~0.21) shows that the baseline heuristic rarely sheds load opportunistically. Learning agents should discover that quick load shedding during high-price, high-stress periods yields significant cost savings. **Grader Determinism:** Same seed always produces identical evaluations. Episodes are seeded internally; reproducible batches of evaluations can be generated for benchmark comparisons. --- ## ๐Ÿš€ Setup & Usage ### Prerequisites - **Docker** โ€” [Download Docker Desktop](https://www.docker.com/products/docker-desktop/) - **Python 3.10+** โ€” [Download Python](https://www.python.org/downloads/) - **Git** โ€” [Download Git](https://git-scm.com/downloads) ### Quick Start (5 minutes) #### 1. Clone the Repository ```bash git clone https://github.com/LO-Kyu/gridmind-rl.git cd gridmind-rl ``` #### 2. Build and Start the Environment Server ```bash docker build -t gridmind-rl . docker run --rm -d -p 7860:7860 -p 7861:7861 --name gridmind gridmind-rl ``` Verify the server is running: ```bash # Check health endpoint curl http://localhost:7860/health # Expected: {"status":"ok","version":"1.0.0"} ``` #### 3. Install Python Dependencies Open a **new terminal** and install: ```bash pip install -r python/requirements.txt ``` #### 4. Run Inference (No LLM โ€” Fast) Run a fast, deterministic baseline using heuristic policy: ```bash python inference.py --fast-mode --episodes 1 ``` Expected output (sample): ``` [START] task=Cost_Minimization env=gridmind model=heuristic [STEP1] step=1 action={...} reward=10.5 done=false [STEP2] step=2 action={...} reward=12.3 done=false ... [STEP96] step=96 action={...} reward=8.9 done=true [END] success=true steps=96 rewards=[10.5, 12.3, ..., 8.9] ``` Results saved to: `baseline_scores.json` #### 5. (Optional) Run with LLM To use an LLM agent for decision-making: 1. Get a **free API key** from [openrouter.ai/keys](https://openrouter.ai/keys) (no credit card needed) 2. Create `.env` file (copy from `.env.example`): ```bash cp .env.example .env ``` 3. Edit `.env` and add your API key: ```env HF_TOKEN=sk-or-v1-your-key-here # or OPENAI_API_KEY=sk-or-v1-your-key-here ``` 4. Run with LLM: ```bash python inference.py --episodes 1 ``` #### 6. Stop the Server (When Done) ```bash docker stop gridmind ``` --- ### Inference Script Reference The `inference.py` script (project root) is the **hackathon submission entrypoint**. **Environment Variables:** | Variable | Default | Description | |----------|---------|-------------| | `HF_TOKEN` | (required for submission) | API key for LLM provider or HF Spaces | | `OPENAI_API_KEY` | (optional fallback) | Alternative OpenAI-compatible key | | `API_BASE_URL` | `https://openrouter.ai/api/v1` | LLM endpoint URL | | `MODEL_NAME` | `meta-llama/llama-3.3-70b-instruct:free` | Model identifier | | `ENV_URL` | `http://localhost:7860` | Environment server address | **Command-Line Flags:** | Flag | Default | Description | |------|---------|-------------| | `--episodes N` | 1 | Episodes per task (runs tasks 1, 2, 3 in sequence) | | `--fast-mode` | off | Don't call LLM; use heuristic policy only (reproducible, no API calls) | | `--llm-every N` | 4 | Reuse each LLM decision for N steps (reduces API calls) | | `--max-steps N` | 96 | Stop episode early after N steps | | `--env-url URL` | from env var | Override environment server URL | | `--output FILE` | `baseline_scores.json` | Output results filename | | `--verbose` | off | Print detailed logs for each step | **Examples:** ```bash # Run all 3 tasks with LLM (1 episode each) python inference.py --episodes 1 # Reproduce baseline fast (no LLM) python inference.py --fast-mode --episodes 1 # Only Task 2, heuristic, verbose output python inference.py --fast-mode --episodes 1 --verbose # Run 5 episodes per task with custom environment python inference.py --episodes 5 --env-url http://my-server:7860 ``` --- ### HTTP API Reference **Base URL:** `http://localhost:7860` | Endpoint | Method | Purpose | Example Body | |----------|--------|---------|---------------| | `/health` | GET | Liveness check | โ€” | | `/ping` | GET | Lightweight ping | โ€” | | `/reset` | POST | Reset episode for a task | `{"task_id": 1, "seed": 42}` | | `/step` | POST | Apply action, get next observation | `{"hvac_power_level": 0.5, "thermal_charge_rate": 0.1, ...}` | | `/state` | GET | Current full state snapshot | โ€” | | `/grade` | GET | Episode score (0.0โ€“1.0) with sub-scores | โ€” | | `/replay` | GET | Full step-by-step trajectory | โ€” | | `/tasks` | GET | Task definitions and grader weights | โ€” | | `/metrics` | GET | Prometheus-format metrics | โ€” | **Example Workflow:** ```bash # 1. Reset to Task 1 with seed 42 curl -X POST http://localhost:7860/reset \ -H "Content-Type: application/json" \ -d '{"task_id": 1, "seed": 42}' # 2. Get initial observation curl http://localhost:7860/state # 3. Take an action curl -X POST http://localhost:7860/step \ -H "Content-Type: application/json" \ -d '{ "hvac_power_level": 0.5, "thermal_charge_rate": 0.1, "batch_job_slot": 1, "load_shed_fraction": 0.0 }' # 4. Check final score after episode completes curl http://localhost:7860/grade ``` --- ## ๐Ÿ“Š Baseline Performance Scores The baseline is a **heuristic policy** (rule-based, no LLM) representing a reasonable but non-optimized control strategy. Your RL agent should aim to exceed these scores. **Baseline Run:** `python inference.py --fast-mode --episodes 1` ### Summary Scores | Task | Difficulty | Score | Model | |------|:----------:|:-----:|-------| | Task 1 โ€” Cost Minimization | ๐ŸŸข Easy | **0.7063** | Heuristic | | Task 2 โ€” Temperature Control | ๐ŸŸก Medium | **0.6333** | Heuristic | | Task 3 โ€” Full Demand Response | ๐Ÿ”ด Hard | **0.5966** | Heuristic | | **Overall Average** | โ€” | **0.6454** | Heuristic | ### Detailed Breakdown #### Task 1 Results - **Task:** Cost minimization (96 hours ร— 15 min = 24 hours) - **Score:** 0.7063 - **Sub-score:** Cost = 0.706 - **Interpretation:** Heuristic achieves ~70% of optimal cost reduction vs baseline #### Task 2 Results - **Task:** Minimize cost while maintaining temperature 19โ€“23ยฐC - **Score:** 0.6333 - **Sub-scores:** - Cost: 0.701 - Temperature constraint: 0.531 (agent violated comfort bounds ~47% of the time) - **Interpretation:** Temperature management is challenging for the heuristic. Tighter thermal control could improve this score significantly. #### Task 3 Results (Most Interesting) - **Task:** Multi-objective: cost, temperature, grid response, batch deadlines, carbon - **Score:** 0.5966 - **Sub-scores:** - Cost: 0.670 - Temperature: 0.573 (similar temperature control challenge as Task 2) - **Grid response: 0.214** โ† Heuristic rarely participates in demand-response - Batch deadline: 1.000 (heuristic always completes jobs on time) - Carbon: 0.657 **Key Insight:** The heuristic's low grid response score (0.21) suggests that learned agents have significant room for improvement by: 1. Recognizing high-price + high-stress periods 2. Proactively shedding load to reduce cost 3. Using thermal storage to recover comfort afterward This multi-objective setting is where RL agents typically exceed heuristic baselines. ### Reproducibility & Evaluation - **Deterministic:** Baseline scores are **deterministic** โ€” same seed always produces identical actions and rewards - **Seeding:** Each task uses a fixed base seed (1100, 1200, 1300) for reproducible evaluation - **Your Submissions:** Your agent will be evaluated on the same seed distribution; compare your scores directly to baseline --- ## ๐Ÿ—๏ธ Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ inference.py (LLM Agent or Heuristic) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ HTTP: POST /reset, /step ยท GET /grade, /state โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Docker Container โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Go Environment โ”‚ โ”‚ Python Dashboard โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Server (:7860) โ”‚ โ”‚ FastAPI + UI (:7861) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Physics engine โ”‚ โ”‚ โ€ข Proxies /api โ†’ :7860 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Reward function โ”‚โ—„โ”€โ”€โ”‚ โ€ข Real-time charts โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Task graders โ”‚ โ”‚ โ€ข State visualization โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Isolated ยท Reproducible ยท Non-root user โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### Project Structure ``` gridmind/ โ”œโ”€โ”€ inference.py โ† Hackathon entrypoint (root) โ”œโ”€โ”€ openenv.yaml โ† OpenEnv spec manifest โ”œโ”€โ”€ Dockerfile โ† Multi-stage build (Go + Python) โ”œโ”€โ”€ .env โ† API credentials (git-ignored) โ”œโ”€โ”€ baseline_scores.json โ† Produced by inference.py โ”‚ โ”œโ”€โ”€ main.go โ† HTTP server (routes, middleware, metrics) โ”œโ”€โ”€ env/ โ† Core environment logic (Go) โ”‚ โ”œโ”€โ”€ environment.go โ† Simulation: physics, thermal dynamics โ”‚ โ”œโ”€โ”€ models.go โ† All data types (Observation, Action, etc.) โ”‚ โ”œโ”€โ”€ rewards.go โ† 7-component dense reward function โ”‚ โ””โ”€โ”€ tasks.go โ† 3 task definitions + deterministic graders โ”‚ โ”œโ”€โ”€ python/ โ† Python support layer โ”‚ โ”œโ”€โ”€ inference.py โ† Full LLM agent + heuristic fallback โ”‚ โ”œโ”€โ”€ models.py โ† Typed Pydantic models (mirrors Go structs) โ”‚ โ”œโ”€โ”€ validate.py โ† OpenEnv spec validation suite โ”‚ โ””โ”€โ”€ requirements.txt โ† Python dependencies โ”‚ โ”œโ”€โ”€ tests/ โ† Automated tests โ”‚ โ”œโ”€โ”€ environment_test.go โ† Go unit tests (determinism, bounds, etc.) โ”‚ โ””โ”€โ”€ test_graders.py โ† Python grader tests (pytest) โ”‚ โ””โ”€โ”€ dashboard/ โ† Optional web dashboard โ”œโ”€โ”€ server.py โ† FastAPI server โ””โ”€โ”€ static/ โ† Frontend assets ``` --- ## ๐Ÿณ Docker | Action | Command | |--------|---------| | **Build** | `docker build -t gridmind-rl .` | | **Run (foreground)** | `docker run --rm -p 7860:7860 -p 7861:7861 --name gridmind gridmind-rl` | | **Run (background)** | `docker run --rm -d -p 7860:7860 -p 7861:7861 --name gridmind gridmind-rl` | | **Stop** | `docker stop gridmind` | | **Run inference inside container** | `docker exec -it gridmind python /app/inference.py --fast-mode` | The Dockerfile uses a **multi-stage build**: 1. **Stage 1** โ€” Go 1.21 Alpine: compiles the environment server binary 2. **Stage 2** โ€” Python 3.11 slim: runs the Go binary + Python dashboard via Supervisor --- ## โ˜๏ธ Hugging Face Space Deployment ### 1. Create a New Space Go to [huggingface.co/new-space](https://huggingface.co/new-space): - **SDK:** Docker - **Hardware:** CPU Basic (2 vCPU, 16 GB โ€” free tier) ### 2. Push to HF ```bash git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/gridmind-rl git push hf main ``` ### 3. Verify ```bash curl https://YOUR_USERNAME-gridmind-rl.hf.space/health # โ†’ {"status":"ok","version":"1.0.0"} curl -X POST https://YOUR_USERNAME-gridmind-rl.hf.space/reset \ -H "Content-Type: application/json" \ -d '{"task_id":1,"seed":42}' ``` > **Note:** HF Spaces exposes port **7860** publicly. The dashboard (7861) is for local development only. --- ## ๐Ÿงช Testing ### Run Go Unit Tests ```bash cd gridmind go test ./tests/ -v ``` ### Run Python Grader Tests (requires server running) ```bash pytest tests/test_graders.py -v ``` ### Run Full OpenEnv Validation ```bash python python/validate.py --env-url http://localhost:7860 ``` --- ## ๐Ÿ“ Inference Script Reference The `inference.py` script at the project root is the **hackathon entrypoint**. ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `API_BASE_URL` | `https://openrouter.ai/api/v1` | LLM API endpoint | | `MODEL_NAME` | `meta-llama/llama-3.1-8b-instruct:free` | Model to use | | `OPENAI_API_KEY` | โ€” | API key (any OpenAI-compatible provider) | | `ENV_URL` | `http://localhost:7860` | Environment server URL | ### Command-Line Flags | Flag | Default | Description | |------|---------|-------------| | `--episodes N` | 1 | Episodes per task (tasks 1โ€“3 run in sequence) | | `--fast-mode` | off | Use heuristic policy only (no LLM, fully reproducible) | | `--llm-every N` | 4 | Reuse each LLM action for N steps (reduces API calls) | | `--max-steps N` | 96 | Stop early after N steps | | `--env-url URL` | from env | Override environment URL | | `--output FILE` | `baseline_scores.json` | Output results file | | `--verbose` | off | Print detailed step logs | ### Stdout Log Format Each episode emits structured markers for automated evaluation: ``` [START] [STEP1] [STEP2] ... [STEP96] [END] ``` --- ## โœ… OpenEnv Specification Compliance GridMind-RL fully implements the OpenEnv specification for standardized RL environments. All components are present and tested: | Requirement | Status | Notes | |-------------|:------:|-------| | Manifest (`openenv.yaml`) | โœ… | All metadata, schema definitions, and version info | | Observation Schema | โœ… | 11-field object: temperature, storage, price, grid signal, carbon, hour, batch queue, cost, step, building_id | | Action Schema | โœ… | 5-field object: HVAC, thermal rate, batch slot, load shed, building_id | | HTTP Endpoints | โœ… | `/reset`, `/step`, `/state`, `/grade`, `/replay`, `/tasks`, `/health`, `/metrics` | | Determinism | โœ… | Seeded episode generation; identical seeds produce identical trajectories | | Typed Models | โœ… | Pydantic models (Python) mirror Go structs exactly | | Dense Rewards | โœ… | 7-component reward breakdown every step | | Graders | โœ… | 3 tasks with programmatic, deterministic graders (0.0โ€“1.0 range) | | Exploit Detection | โœ… | Built into grading pipeline to flag unrealistic scores | --- ## โ“ FAQ **Q: Can I use a different model?** A: Yes. Set `MODEL_NAME` environment variable to any OpenAI-compatible model. The default (`meta-llama/llama-3.3-70b-instruct:free`) is free on OpenRouter with no credit card. **Q: How do I avoid rate limiting?** A: (1) Use `--fast-mode` for local testing (no API calls), (2) Set `--llm-every 4` to reuse decisions, (3) Use a paid API tier for submission, or (4) Train & submit an offline policy. **Q: Will my API key be exposed in submissions?** A: No. Store your API key in `.env` (git-ignored). On HF Spaces, set secrets via the Space settings UI; keys are never committed to the repo. **Q: What's the difference between `HF_TOKEN` and `OPENAI_API_KEY`?** A: `HF_TOKEN` is used in HF Space deployments and external evaluations. `OPENAI_API_KEY` is a fallback for local development. The code tries `HF_TOKEN` first, then `OPENAI_API_KEY`. At least one must be set. **Q: Can I submit an offline/trained policy?** A: Yes. Modify `python/inference.py` to use your trained agent instead of LLM calls. Ensure you still output the required `[START]`, `[STEP]`, `[END]` format. **Q: What if my submission times out?** A: Each episode is 96 steps. The environment runs 3 episodes (one per task). Optimize for latency: reduce LLM calls (use `--llm-every`), use a faster model, or submit a heuristic/trained offline policy. --- ## ๐ŸŽฏ Submission Checklist Before submitting, verify: - [ ] Clone repo, build Docker, run `docker run -p 7860:7860 -p 7861:7861 gridmind-rl` - [ ] Run `python inference.py --fast-mode --episodes 1` locally โ€” should produce `baseline_scores.json` - [ ] Check `[START]`, `[STEP]`, `[END]` markers in stdout - [ ] Set `HF_TOKEN` or `OPENAI_API_KEY` in `.env` for LLM runs - [ ] Test with LLM: `python inference.py --episodes 1` - [ ] Verify Dockerfile builds without errors: `docker build -t gridmind-rl .` - [ ] Create HF Space (Docker SDK, CPU Basic) - [ ] Push repo to HF Space: `git push hf main` - [ ] Set secrets in HF Space UI: `HF_TOKEN`, `API_BASE_URL` (optional), `MODEL_NAME` (optional) - [ ] Verify Space is running: `curl https://YOUR_USERNAME-gridmind-rl.hf.space/health` - [ ] Submit Space URL to hackathon organizers --- ## ๐Ÿ“š Additional Resources - **OpenEnv Spec:** https://github.com/meta-pytorch/OpenEnv - **OpenRouter Free Models:** https://openrouter.ai/keys - **HF Spaces Docs:** https://huggingface.co/docs/hub/spaces - **GridMind Repository:** https://github.com/LO-Kyu/gridmind-rl --- ## ๐Ÿ“„ License See `LICENSE` in the repository.