# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What this is

`OrbitalThrusterEnv` — an OpenEnv-compliant FastAPI environment for **OpenEnv Hackathon Theme #2 (Long-Horizon Planning & Instruction Following)**, plus a TRL+Unsloth training pipeline. The agent (an LLM) controls a spacecraft through a 5-phase mission (`detumble → coast → retarget → anomaly recovery → precision hold`) using 13 discrete thruster actions and a required `control_mode` declaration. Submission is the HF Space `pixxel-phantom/orbital-thruster-env`.

## Run

```bash
# Local server
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 7860

# Validate (requires server running at :7860 first — 22+ contract checks)
python validate.py

# Tests (no live server needed — conftest.py installs openenv stubs)
pytest tests/ -q
pytest tests/test_mission_ops_env.py::test_flagship_task_has_long_horizon_directives_and_anomalies -q

# Baseline rollout (random / deterministic-PD / tuned-PD)
python training/evaluate_baselines.py

# Generate seed trajectories (tuned-PD expert → training/data/seed_trajectories.jsonl)
python training/generate_seed_trajectories.py

# Full training (laptop or cloud)
python training/qwen3_smoke_sft.py      # SFT 80 steps QLoRA (Unsloth)
python training/qwen3_grpo_train.py     # GRPO 300 steps with 5 reward funcs
python training/eval_trained_model.py   # trained vs baselines

# Cloud training (HF Jobs, requires token with jobs.write)
hf jobs uv run --flavor l4x1 --timeout 2h --secrets HF_TOKEN \
  -e ORBITAL_BASE_MODEL=Qwen/Qwen2.5-1.5B-Instruct \
  -e ORBITAL_SFT_STEPS=40 -e ORBITAL_GRPO_STEPS=80 -e ORBITAL_NUM_GEN=4 \
  -e ORBITAL_SKIP_SFT_WARMUP=1 \
  -e OUTPUT_REPO=pixxel-phantom/orbital-thruster-grpo-fast \
  -d training/hf_job_train.py

# Push to HF Space (uses huggingface_hub, ignores venvs/caches)
python -c "from huggingface_hub import HfApi; HfApi().upload_folder(folder_path='.', repo_id='pixxel-phantom/orbital-thruster-env', repo_type='space', ignore_patterns=['venv*/**','**/__pycache__/**','.git/**','trainer_output/**','*.pyc'])"
```

## Architecture

### Contracts (read before touching anything)
- `models.py` — Pydantic schemas: `OrbitalThrusterAction` (13 `ActionType` values + 7 `ControlMode` values + optional `reason`), `OrbitalThrusterObservation`, `EnvState`. **Schema is the OpenEnv contract** — judges validate via `validate.py`.
- `openenv.yaml` — manifest loaded by `validate.py`; must stay consistent with server routes and schema.

### Server (OpenEnv-compliant FastAPI)
- `server/app.py` — wires `OrbitalThrusterEnvironment` into `openenv.core.env_server.create_app`. Strips the auto-generated `GET /state` route and re-adds a custom one returning `EnvState.model_dump()`. Exposes `POST /reset_hard` to bump generation and rebuild the env (workaround for single-session env).
- `server/orbital_thruster_environment.py` — `Environment[Action, Observation, EnvState]` impl. Holds entire mission state in instance fields. `step()` flow: propagate dynamics → compute errors → detect overshoot → update on-target streak → check directive milestone → score reward → assemble observation. **`SUPPORTS_CONCURRENT_SESSIONS = False`** — one episode at a time; built with `max_concurrent_envs=1`.
- `server/dynamics.py` — pure-Python rigid-body dynamics with seeded sinusoidal disturbances. `propagate()` is the physics step; `signed_angle_error()` and `vector_magnitude()` reused everywhere.
- `server/reward.py` — `RewardScorer` returns scalar reward **and** 6-key rubric (`physical_tracking`, `fuel_discipline`, `milestone_completion`, `control_mode`, `anomaly_recovery`, `anti_stall_penalty`). `score_episode()` and `is_success()` are separate, difficulty-weighted formulas. **Do not modify reward logic** unless fixing a real bug.
- `server/tasks/base.py` — base classes and dataclasses for task definition: `MissionTask` (ABC), `TimedTarget` (per-phase directive with `attitude_deg`, `deadline_step`, `milestone`, `recommended_modes`, `fuel_reserve_target`), `MissionAnomaly`, `TaskConfig`, `ControlProfile`, `DisturbanceProfile`. This is the extension point when adding new tasks.
- `server/tasks/` — one `MissionTask` per file (`task_easy`, `task_medium`, `task_hard`, `task_flagship`). `task_flagship.py` is the headline 360-step `mission_ops_long_horizon` task with 5 timed directives, fuel reserve targets per phase, and a `gyro_bias_spike` anomaly. Tasks register in `server/tasks/__init__.py::TASK_REGISTRY`.

### Inference / agent contract
- `inference.py` — single source of truth for valid actions/control modes (`VALID_ACTIONS`, `VALID_CONTROL_MODES`), the system prompt, and two non-LLM controllers (`deterministic_controller`, `tuned_mission_controller` — PD baselines and LLM fallback). `choose_action()` posts to an OpenAI-compatible LLM (HF Router by default), falls back to tuned PD on failure or invalid JSON.
- `client.py` — typed `OrbitalThrusterEnv(EnvClient)` for driving a running server programmatically.

### Training pipeline
- `training/common.py` — shared utilities. `collect_seed_records()` runs the tuned-PD expert and writes `training/data/seed_trajectories.jsonl`; `build_prompt()` is the prompt format for SFT and GRPO; `parse_action_json()` validates LLM JSON output. Adds `ROOT` (repo root) to `sys.path` on import.
- `training/rl_utils.py` — **5 independent reward functions** (`reward_format`, `reward_env_step`, `reward_mode_match`, `reward_anti_spam`, `reward_fuel_discipline`) consumed by `GRPOTrainer.reward_funcs=[...]`. `reward_env_step` replays `history_actions` into a fresh env to score the candidate action — this is the verifier signal. Also exports `make_lora_controller()`, `RewardCSVLogger`, `plot_training_curves()`.
- `training/qwen3_smoke_sft.py` — Unsloth `FastLanguageModel` + TRL `SFTTrainer`, QLoRA r=16, JSON-format priming on tuned-PD traces. Reads `ORBITAL_SFT_STEPS` env var (default 80). **`import unsloth` MUST precede transformers** or patches don't apply.
- `training/qwen3_grpo_train.py` — Unsloth + TRL `GRPOTrainer`. Loads SFT adapter via `safetensors.load_file` + `model.load_state_dict(strict=False)` with dtype-cast (standard `model.load_adapter()` is broken with Unsloth). Set `ORBITAL_SKIP_SFT_WARMUP=1` to skip overlay on dtype/key mismatches. Reads `ORBITAL_GRPO_STEPS` (default 300) and `ORBITAL_NUM_GEN` (default 6).
- `training/hf_job_train.py` — UV-script entrypoint for `hf jobs uv run`. PEP 723 deps block at top. Downloads HF Space repo via `snapshot_download`, runs SFT → GRPO → eval, uploads to `OUTPUT_REPO`.
- `training/eval_trained_model.py` — rolls trained adapter across all 4 tasks, compares vs random/deterministic/tuned-PD, writes `outputs/eval_trained/trained_vs_baseline.{csv,png}`.
- `training/local_train.py` — vanilla TRL+peft+bnb fallback (no Unsloth) for Windows. Use only if Unsloth import fails.

### Tests
- `tests/conftest.py` — installs openenv stubs so tests run without the actual `openenv-core` package and without a live server.
- `tests/test_mission_ops_env.py` — unit tests for env contract, milestone sequencing, anomaly injection, flagship task structure.
- `tests/test_baseline_regression.py` — full rollouts via `deterministic_controller` and `tuned_mission_controller`; asserts tuned-PD beats deterministic and random across easy/flagship tasks.

### Output conventions
- `trainer_output/qwen_sft/` — SFT LoRA adapter
- `trainer_output/qwen_grpo/` — final GRPO LoRA adapter
- `outputs/baseline_eval/baseline_summary.{csv,png}` — random/det/tuned-PD baselines
- `outputs/training/grpo_metrics.{csv,png}` — per-component reward + loss curves
- `outputs/eval_trained/trained_vs_baseline.{csv,png}` — final comparison

## Hard rules

- **Don't modify the reward function or task definitions** unless fixing a real bug. The rubric design is a judging signal; tampering invalidates the multi-component anti-reward-hacking story.
- **Don't reinvent the env API.** It must remain OpenEnv-compliant (`reset`, `step`, `state`, action/observation Pydantic models). `validate.py` enforces 22+ contract checks — if it fails, the submission is invalid.
- **Save LoRA adapters only**. Never naively merge a 4-bit model to 16-bit and re-save — it damages weights.
- **Curriculum + 5-reward design is the anti-hacking story.** Don't collapse to a single reward signal — judges score 20% on improvement *without* reward hacking.
- **The flagship task is `mission_ops_long_horizon`** — all demo plots and the README pitch are framed around it.

## Submission state

- HF Space: `pixxel-phantom/orbital-thruster-env` (live, Docker SDK)
- Trained adapter: `pixxel-phantom/orbital-thruster-grpo-fast` (1.5B, deadline run) and `pixxel-phantom/orbital-thruster-grpo` (4B, post-deadline)
- README has placeholder slots for blog/video URLs at the top — fill before submitting
- Submission requirements (themes.md): OpenEnv ✓, training script ✓, loss+reward plots from real run (auto-generated post-training), <2 min video OR HF blog (manual), Space URL ✓, README that motivates + shows results