# OrbitalThrusterEnv — Training & Evaluation Results **Date**: 2026-04-28 **Environment**: `OrbitalThrusterEnv` — OpenEnv Hackathon Theme #2 (Long-Horizon Planning & Instruction Following) **Hardware**: NVIDIA RTX 4060 Laptop GPU (8.6 GB VRAM) + HF Cloud L4 GPU (24 GB) --- ## Models Trained | Model | Size | Phase | Hardware | Steps (SFT + GRPO) | Status | |-------|------|--------|----------|-------------------|--------| | Qwen2.5-1.5B-Instruct | 1.5B | SFT → GRPO | Local 4060 | 100 + 200 | Complete | | Qwen2.5-3B-Instruct | 3B | SFT → GRPO | Local 4060 | 120 + 250 | Complete | | Qwen2.5-7B-Instruct | 7B | SFT → GRPO | HF Cloud L4 | 80 + 150 | Complete | --- ## Training Pipeline ``` Expert PD controller → seed_trajectories.jsonl ↓ SFT (QLoRA r=16, nf4 4-bit, cosine LR) ↓ GRPO (QLoRA r=32, warm-started from SFT adapter) ↓ 5 independent reward functions (anti-reward-hacking): • reward_format — valid JSON action schema • reward_env_step — env verifier (replay history → real env reward) • reward_mode_match — correct control_mode for mission phase • reward_anti_spam — penalise repeating same action • reward_fuel_discipline — penalise excess fuel burn ``` --- ## 1.5B Local Training Results ### SFT (100 steps, Qwen2.5-1.5B-Instruct) | Metric | Start | End | |--------|-------|-----| | Loss | 2.39 | 0.66 | | Token Accuracy | 52.7% | 77.9% | | Train Runtime | — | 7m 41s | ### GRPO (200 steps) | Metric | Step 2 | Step 200 | Trend | |--------|--------|----------|-------| | reward_format | 1.000 | 1.000 | stable (perfect JSON) | | reward_env_step | 0.585 | 0.805 | ↑ +38% | | reward_mode_match | 0.250 | 0.250 | stable | | reward_anti_spam | -0.050 | -0.175 | ↓ (over-conservative) | | reward_fuel_discipline | 0.000 | 0.150 | ↑ newly learned | | **Total reward** | **1.835** | **2.093** | **↑ +14%** | | Loss | 0.092 | 0.085 | ↓ converging | | KL divergence | 2.451 | 1.659 | ↓ stable | --- ## 3B Local Training Results ### SFT (120 steps, Qwen2.5-3B-Instruct) | Metric | Start | End | |--------|-------|-----| | Loss | 2.39 | 0.41 | | Token Accuracy | ~52% | 84.3% | | Train Runtime | — | 15m 56s | ### GRPO (250 steps) | Metric | Final step | |--------|-----------| | reward_format | 1.000 (perfect) | | reward_env_step | 0.805 | | reward_mode_match | 0.250 | | Total reward | ~2.09 | | Train Loss | 0.0906 | | Train Runtime | 29m 41s | --- ## Evaluation: All Policies × All Tasks Rollout on 4 tasks, greedy decoding (`do_sample=False`). ### Full Results Table | Policy | Task | Reward | Success | Fuel Used | Milestones | |--------|------|--------|---------|-----------|------------| | random | detumble_satellite | 23.86 | no | 69.5 | 0 | | random | retarget_180_flip | 3.22 | no | 120.0 | 0 | | random | long_horizon_precision_hold | -25.33 | no | 85.0 | 0 | | random | mission_ops_long_horizon | -53.48 | no | 90.0 | 0 | | deterministic | detumble_satellite | 17.56 | **yes** | 18.7 | 1 | | deterministic | retarget_180_flip | 97.37 | no | 120.0 | 2 | | deterministic | long_horizon_precision_hold | 21.10 | no | 85.0 | 0 | | deterministic | mission_ops_long_horizon | 89.83 | no | 90.0 | 2 | | tuned_pd | detumble_satellite | **34.20** | **yes** | 16.8 | 1 | | tuned_pd | retarget_180_flip | **120.13** | **yes** | 100.9 | 1 | | tuned_pd | long_horizon_precision_hold | 27.49 | no | 83.6 | 0 | | tuned_pd | mission_ops_long_horizon | **115.84** | no | 88.8 | 0 | | trained (1.5B) | detumble_satellite | 9.24 | no | 0.0 | 0 | | trained (1.5B) | retarget_180_flip | 38.31 | no | 0.0 | 0 | | trained (1.5B) | long_horizon_precision_hold | **88.00** | no | 0.0 | 0 | | trained (1.5B) | mission_ops_long_horizon | 22.61 | no | 0.0 | 0 | | trained (3B) | detumble_satellite | 9.24 | no | 0.0 | 0 | | trained (3B) | retarget_180_flip | 38.31 | no | 0.0 | 0 | | trained (3B) | long_horizon_precision_hold | **88.00** | no | 0.0 | 0 | | trained (3B) | mission_ops_long_horizon | 22.61 | no | 0.0 | 0 | ### Summary: Flagship Task (mission_ops_long_horizon) | Policy | Reward | vs random | |--------|--------|-----------| | random | -53.48 | baseline | | trained 1.5B | 22.61 | +76.09 | | trained 3B | 22.61 | +76.09 | | deterministic PD | 89.83 | +143.31 | | tuned PD | 115.84 | +169.32 | --- ## Key Findings ### What worked - **reward_format = 1.0** from step 2 onward — model learns valid JSON immediately after SFT warm-start - **reward_env_step improved** 0.58 → 0.81 (1.5B), indicating physical tracking improved - **reward_fuel_discipline** learned from zero — model begins penalising waste - **precision_hold task: 88.0 vs tuned_PD 27.5** — trained model significantly outperforms PD baseline on hold task ### What didn't work - **fuel_used = 0.0 across all tasks** — both 1.5B and 3B chose zero-thrust actions for every rollout step (greedy decoding converged to "hold" action) - **1.5B ≡ 3B** identical eval scores — reward shaping induced same conservative fixed-point regardless of model capacity - **reward_anti_spam degraded** — model learned to avoid repetition but compensated with passive hold strategy ### Root cause: reward over-exploitation `reward_fuel_discipline + reward_anti_spam` pushed models toward zero-thrust hold strategy. The `reward_env_step` verifier reward (replaying env) may have been dominated by the conservative reward signal during GRPO batches. ### Suggested fix (next training run) - Remove or down-weight `reward_fuel_discipline` during early GRPO - Add `reward_min_thrust_usage` to penalise doing nothing - Use `do_sample=True, temperature=0.7` during eval rollout (not greedy) --- ## Artifacts | Artifact | Path | |----------|------| | 1.5B SFT adapter | `trainer_output/qwen_sft_1b5/` | | 1.5B GRPO adapter | `trainer_output/qwen_grpo_1b5/` | | 3B SFT adapter | `trainer_output/qwen_sft_3b/` | | 3B GRPO adapter | `trainer_output/qwen_grpo_3b/` | | GRPO reward curves | `outputs/training/grpo_metrics.png` | | 1.5B eval CSV + plot | `outputs/eval_trained/eval_1b5.{csv,png}` | | 3B eval CSV + plot | `outputs/eval_trained/eval_3b.{csv,png}` | | Cloud 7B adapter | `pixxel-phantom/orbital-thruster-grpo` (HF) | | Cloud 1.5B adapter | `pixxel-phantom/orbital-thruster-grpo-fast` (HF) | --- ## Cloud Job Status | Job | Model | Steps | Output repo | Status | |-----|-------|-------|-------------|--------| | 69eddfaad2c8bd8662bcfbcd | Qwen2.5-1.5B | 40 SFT + 60 GRPO | `pixxel-phantom/orbital-thruster-grpo-fast` | COMPLETED | | 69f083c1d70108f37ace0ead | Qwen2.5-7B | 80 SFT + 150 GRPO | `pixxel-phantom/orbital-thruster-grpo` | COMPLETED |