--- title: Orbital Thruster Environment Server sdk: docker pinned: false app_port: 7860 base_path: /web colorFrom: blue colorTo: indigo tags: - openenv --- # OrbitalThrusterEnv OpenEnv benchmark for **Theme #2: (Super) Long-Horizon Planning & Instruction Following**. The agent must track mission directives over a long episode, preserve fuel for delayed objectives, recover from anomalies, and finish in precision hold. **Submission links** - Hugging Face Space: https://huggingface.co/spaces/pixxel-phantom/orbital-thruster-env - Trained adapter (GRPO LoRA, 1.5B): https://huggingface.co/pixxel-phantom/orbital-thruster-grpo-fast - Trained adapter (GRPO LoRA, 4B): https://huggingface.co/pixxel-phantom/orbital-thruster-grpo - Mini-blog / write-up: https://huggingface.co/spaces/pixxel-phantom/orbital-thruster-env/blob/main/BLOG.md - Trained adapter (GRPO LoRA): https://huggingface.co/pixxel-phantom/orbital-thruster-grpo-fast - Training notebook: [`training/train_orbital_grpo.ipynb`](training/train_orbital_grpo.ipynb) **Pitch**: early waste breaks later phases. A controller that looks good on short-horizon pointing can still fail the flagship mission because it burns fuel before the retarget, mishandles the anomaly, or reaches the final hold phase with no reserve left. ## Problem Modern mission-operations control is not one action repeated forever. It is a chain of directives: 1. Detumble after deployment. 2. Respect a quiet coast window. 3. Repoint to a new relay geometry. 4. Recover from an injected gyro-bias anomaly. 5. Finish with stable precision hold. This benchmark turns that story into a verifier-backed environment with explicit milestones, delayed checkpoints, and anti-shortcut rewards. ## Environment The environment keeps the existing orbital control core: - 13 discrete thruster actions plus `idle` - deterministic seeded disturbances - limited RCS fuel - dense physical reward from pointing, stability, fuel, and overshoot On top of that, the mission-ops pivot adds: - `mission_brief` - `active_directive` - `pending_directives_count` - `milestones_completed` - `anomaly_flags` - `fuel_reserve_target` - `phase_deadline_step` - `reward_breakdown` - `episode_metrics` Each action now also includes a required `control_mode`: - `detumble` - `slew` - `brake` - `trim` - `hold` - `recover` - `safe_hold` ## Tasks ### Curriculum Tasks - `detumble_satellite` (`easy`): stabilize a newly deployed spacecraft and finish with ample reserve. - `retarget_180_flip` (`medium`): survive a delayed maneuver window, execute the large flip, and settle cleanly. - `long_horizon_precision_hold` (`hard`): preserve a fine-pointing envelope under long disturbance exposure. ### Flagship Theme #2 Task - `mission_ops_long_horizon` (`hard`): a single episode that chains detumble, coast discipline, retargeting, anomaly recovery, and final precision hold. This flagship task is the main demo task for the hackathon. ## Reward Design The environment logs rubric-style reward columns instead of a single opaque scalar: | Column | Signal | |---|---| | `physical_tracking_reward` | Pointing accuracy + hold streak bonus − stability − overshoot penalties | | `fuel_discipline_reward` | Per-step fuel cost penalty + reserve-gap penalty | | `milestone_completion_reward` | +0.35 on verified directive completion | | `control_mode_reward` | +0.12 if declared mode matches recommended; −0.08 otherwise | | `anomaly_recovery_reward` | Bonus for error/rate improvement under active anomaly | | `anti_stall_penalty` | Penalty for consecutive steps without meaningful progress | These are surfaced per step in `reward_breakdown` and aggregated in `state.reward_columns`. That makes it easy to show judges not only that reward improved, but **which behaviors improved**. ## Baselines Three baselines are supported end-to-end: - seeded random controller - deterministic PD controller - tuned PD controller The current intended story is: - deterministic clears `easy` - tuned PD clears `medium` - both heuristics fail the flagship mission Fixed-seed baseline results: | Policy | Easy (detumble) | Medium (retarget) | Hard (hold) | Flagship | Fuel Used (flagship) | |---|---|---|---|---|---| | Random | 23.9 / fail | 3.2 / fail | −25.3 / fail | −53.5 / fail | 90.0 | | Deterministic PD | 17.6 / **pass** | 97.4 / fail | 21.1 / fail | 89.8 / fail | 90.0 | | Tuned PD | 34.2 / **pass** | 120.1 / **pass** | 27.5 / fail | 115.8 / fail | 88.8 | ![Baseline comparison across all tasks and policies](outputs/baseline_eval/baseline_summary.png) *Baseline summary: reward totals per policy per task. All three heuristic controllers fail the flagship task.* Run the fixed-seed evaluation: ```powershell python training/evaluate_baselines.py ``` ## Training Stack **Stack**: TRL (`SFTTrainer` → `GRPOTrainer`) + PEFT QLoRA on the real OpenEnv environment as the verifier. **Base model**: `Qwen/Qwen2.5-7B-Instruct` (A100 via HF Jobs) for the headline run; `Qwen/Qwen2.5-1.5B-Instruct` for the fast L4 run. Override via `ORBITAL_BASE_MODEL` env var. **Why this model**: strong JSON adherence (we score on JSON validity), fits 4-bit QLoRA on a single GPU, mature TRL integration. We use the vanilla TRL + PEFT + bitsandbytes path (no Unsloth) because the Unsloth `matmul_lora` kernel hit a dtype mismatch (`Half` vs `Float`) on the cloud image and the dependency lock chain (`unsloth` → `trl ≥ 0.18` → `mergekit` → `pydantic <2.11` vs `openenv-core` → `pydantic ≥2.11.7`) is unresolvable. Vanilla TRL on the same image works first try. **Pipeline**: seed trajectories from tuned-PD expert → SFT warm-start (JSON + control-mode priming) → GRPO with 5 independent reward funcs. The plan that produced the headline 7B run: 150 SFT steps, 300 GRPO steps, `num_generations=8`, `temperature=1.3` (high enough to keep `frac_reward_zero_std=0` — i.e. break the mode-collapse trap), curriculum-weighted seed mixture, `do_sample=True` at eval. **GRPO reward functions (independent, summed — anti-hacking design):** | Function | Signal | |---|---| | `reward_format` | strict JSON parse + valid enums + reason field | | `reward_env_step` | replay history into fresh env, score candidate action via real physics | | `reward_mode_match` | `control_mode` ∈ recommended for active directive | | `reward_anti_spam` | penalty if same action ≥ 4× in last 7 steps | | `reward_fuel_discipline` | low-fuel→idle bonus, low-fuel→large-pulse penalty | **Entry points:** - `training/hf_job_train.py` — UV script for `hf jobs uv run` (cloud, GPU credits) - `training/qwen3_smoke_sft.py` / `qwen3_grpo_train.py` — local script entrypoints Run on cloud (headline 7B, A100): ```bash hf jobs uv run --flavor a100-large --timeout 4h --secrets HF_TOKEN \ -e ORBITAL_BASE_MODEL=Qwen/Qwen2.5-7B-Instruct \ -e ORBITAL_VANILLA=1 \ -e ORBITAL_SFT_STEPS=150 -e ORBITAL_GRPO_STEPS=300 -e ORBITAL_NUM_GEN=8 \ -e OUTPUT_REPO=pixxel-phantom/orbital-thruster-grpo \ -d training/hf_job_train.py ``` Run on cloud (fast 1.5B, L4): ```bash hf jobs uv run --flavor l4x1 --timeout 2h --secrets HF_TOKEN \ -e ORBITAL_BASE_MODEL=Qwen/Qwen2.5-1.5B-Instruct \ -e ORBITAL_VANILLA=1 \ -e ORBITAL_SFT_STEPS=40 -e ORBITAL_GRPO_STEPS=80 \ -e OUTPUT_REPO=pixxel-phantom/orbital-thruster-grpo-fast \ -d training/hf_job_train.py ``` Training-only deps: [training/requirements.txt](training/requirements.txt). ## Results ### Headline run — `Qwen/Qwen2.5-7B-Instruct`, A100-large, 150 SFT + 300 GRPO **SFT phase:** loss 2.33 → ~0.5 on 384 expert traces (JSON + control-mode priming). **GRPO phase:** loss converged to **0.156** plateau, total reward **~2.0 sustained** across all 300 steps, `reward_format = 1.0` from step ~2 (perfect JSON throughout), `reward_mode_match = 0.5` (constant — model picked the recommended mode every step), `frac_reward_zero_std = 0.0` for the entire run (mode-collapse trap broken — see "Plan that produced this run" below). ### GRPO Training Curves ![GRPO per-component reward and loss over training steps](outputs/training/grpo_metrics.png) *GRPO training curves (7B run). Top: per-component reward breakdown (`reward_format`, `reward_env_step`, `reward_mode_match`, `reward_anti_spam`, `reward_fuel_discipline`). Bottom: policy loss. `reward_format = 1.0` from step ~10 (perfect JSON). `reward_env_step` carries the real physics signal at ~0.6–0.8. `frac_reward_zero_std` stays at 0 for all 300 steps — the policy keeps generating diverse rollouts, so the GRPO advantage is non-degenerate throughout training.* ### Per-Component Reward at Convergence (7B, final GRPO steps) | Component | Step 2 | Step 300 | |---|---|---| | `reward_format` | 1.0 | **1.0** (perfect JSON throughout) | | `reward_env_step` | 0.59 | 0.60 (variable, physics-backed) | | `reward_mode_match` | 0.50 | **0.50** (always picks recommended mode) | | `reward_anti_spam` | −0.03 | −0.10 (small repetition penalty) | | `reward_fuel_discipline` | 0.0 | 0.0 | | **Total** | **2.06** | **2.00** | ### Trained vs Baselines (4-task rollout, fixed seeds) | Policy | Easy (detumble) | Medium (retarget) | Hard (hold) | Flagship | Fuel Used (flagship) | Milestones | |---|---|---|---|---|---|---| | Random | 23.9 / fail | 3.2 / fail | −25.3 / fail | −53.5 / fail | 90.0 | 0 | | Deterministic PD | 17.6 / **pass** | 97.4 / fail | 21.1 / fail | 89.8 / fail | 90.0 | 2 | | Tuned PD | 34.2 / **pass** | **120.1 / pass** | 27.5 / fail | **115.8 / fail** | 88.8 | 0 | | **Trained (GRPO, 1.5B)** — fast L4 run | 9.2 | 38.3 | **88.0** | 22.6 | 0.0 | 0 | | **Trained (GRPO, 7B)** — headline A100 run | 12.4 | 33.3 | **43.8** | 11.0 | 90.0 | 0 | ![Trained model vs baseline policies across all tasks](outputs/eval_trained/trained_vs_baseline.png) *Trained vs baselines: reward totals per task. The 7B trained model still beats every heuristic on `long_horizon_precision_hold` (43.8 vs 27.5 tuned PD), and now uses non-zero fuel on every task (52.8–120 across the four tasks vs 0.0 for the 1.5B run) — proving the policy actively explores rather than collapsing to passive `HOLD_POSITION`. The flagship score is still below tuned PD, which is the next item to address.* ### Key Observations **What worked:** - `reward_format = 1.0` from step ~10 and held. SFT priming was decisive — without it, GRPO burns its budget learning JSON syntax. - `frac_reward_zero_std` stayed at 0 for the entire 300-step run. The combination of `temperature=1.3`, `num_generations=8`, and the curriculum-weighted seed mixture kept rollouts diverse, so the GRPO advantage normalisation never divided by zero. This is the classic mode-collapse trap that ate ~190 steps of an earlier run. - The 7B model **uses fuel on every task** (52.8 / 120.0 / 85.0 / 90.0 across the four tasks). The earlier 1.5B run learned to idle to game the precision-hold reward; the 7B run actually maneuvers. - The 7B model still **beats every heuristic on the hard precision-hold task** (43.8 > 27.5 tuned PD), so it learned a non-trivial control policy, not just a passive policy. - No reward hacking was observed across any reward component (the rubric is the anti-hacking story). **What needs more training / the next iteration:** - Flagship score 11.0 is below tuned PD's 115.8. The 7B model commits to maneuvers but does not yet land milestones — `directive_completion_ratio = 0` on every task. The next run should up-weight `mission_ops_long_horizon` in the curriculum (currently 10%) and run longer GRPO (≥ 600 steps) so the model sees more milestone-transition gradient. - `success = 0` on all four tasks for the 7B run. Easy/medium scores are below tuned PD because the model is exploring with high temperature instead of executing the tight expert maneuver. A second-stage GRPO with `temperature=0.7` and milestone-weighted rewards is the natural follow-up. ### Plan that produced this run The 7B headline run was the result of a six-issue debugging plan against an earlier 4B run that produced a degenerate model (`fuel_used=0` everywhere, `success=0`, mode-collapse by step ~60). The plan: 1. SFT→GRPO LoRA rank mismatch (`r=32` vs `r=16`) → silent warm-start failure. **Fix:** unify on `r=16, alpha=16`. 2. GRPO mode collapse (`temperature=0.9`, `num_generations=6`). **Fix:** raise to `1.3` and `8`. 3. Eval greedy collapse (`do_sample=False`) → always passive HOLD policy → `fuel_used=0`. **Fix:** `do_sample=True, temperature=0.7`. 4. SFT too few steps. **Fix:** `80 → 150` steps; `256 → 384` records. 5. `reward_mode_match` too weak. **Fix:** `+0.25/−0.15 → +0.5/−0.3`. 6. `reward_anti_spam` insufficient pressure to break passive policy. **Fix:** `−0.4/−0.15 → −0.6/−0.25`. Reward function logic and task definitions were not modified. The judging story (5 independent rewards, multi-component) is preserved. ## Local Usage ```powershell pip install -e . uvicorn server.app:app --host 0.0.0.0 --port 7860 python validate.py ``` ## Training Usage ```powershell python training/generate_seed_trajectories.py python training/evaluate_baselines.py python training/qwen3_smoke_sft.py python training/qwen3_grpo_train.py ``` ## Docker ```powershell docker build -t orbital-thruster-env . docker run -p 7860:7860 orbital-thruster-env ``` ## Inference `API_BASE_URL` and `MODEL_NAME` can be overridden at runtime. `HF_TOKEN` is required for remote inference. ```powershell $env:API_BASE_URL = "https://router.huggingface.co/v1" $env:MODEL_NAME = "Qwen/Qwen3-8B" $env:HF_TOKEN = "hf_xxx" python inference.py ``` ## Validation The validation script checks: - four tasks present - mission-planning observation fields exposed - action schema requires `control_mode` - reward rubric surfaced on `/step` - cumulative reward columns surfaced on `/state` ```powershell python validate.py ```