{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OrbitalThrusterEnv — RL Training Notebook (SFT → GRPO)\n", "\n", "Trains a small open LLM to act as a long-horizon mission-ops controller inside `OrbitalThrusterEnv`.\n", "\n", "**Stack**: Unsloth + TRL (`SFTTrainer` → `GRPOTrainer`) + OpenEnv verifier rewards.\n", "\n", "**Default model**: `Qwen/Qwen2.5-3B-Instruct` (fits 4060 laptop, 8 GB VRAM, 4-bit). Override via `ORBITAL_BASE_MODEL` env var to `Qwen/Qwen3-4B-Instruct-2507` when running on Hugging Face credits.\n", "\n", "**Pipeline**: seed trajectories → SFT (format priming) → GRPO with 5 independent reward funcs → eval vs baselines → plots.\n", "\n", "**Anti-reward-hacking**: format check, env-step verifier, recommended-mode check, action-spam penalty, fuel-discipline penalty (5 independent signals)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Environment setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os, sys\n", "from pathlib import Path\n", "ROOT = Path.cwd().parent if Path.cwd().name == 'training' else Path.cwd()\n", "if str(ROOT) not in sys.path: sys.path.insert(0, str(ROOT))\n", "if str(ROOT / 'training') not in sys.path: sys.path.insert(0, str(ROOT / 'training'))\n", "os.environ.setdefault('ORBITAL_BASE_MODEL', 'Qwen/Qwen2.5-3B-Instruct')\n", "print('ROOT:', ROOT)\n", "print('MODEL:', os.environ['ORBITAL_BASE_MODEL'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# One-time install (uncomment when running fresh)\n", "# !pip install -q -r training/requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Generate / load seed trajectories from the tuned-PD expert" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from common import collect_seed_records\n", "seed_path = ROOT / 'training' / 'data' / 'seed_trajectories.jsonl'\n", "if not seed_path.exists():\n", " collect_seed_records(seed_path, episodes_per_task=3, max_records_per_task=128)\n", "print('seed file size:', seed_path.stat().st_size, 'bytes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Baseline rollout (random / deterministic / tuned-PD)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "subprocess.run([sys.executable, str(ROOT / 'training' / 'evaluate_baselines.py')], check=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. SFT — short JSON-format priming (~80 steps QLoRA)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "from qwen3_smoke_sft import main as run_sft\nrun_sft()" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. GRPO — verifier-driven RL with multi-component rewards\n", "\n", "Reward components (each independent, summed by GRPO):\n", "- `reward_format` — JSON parses + valid enums + has reason\n", "- `reward_env_step` — replays history, scores candidate via real env\n", "- `reward_mode_match` — control_mode ∈ recommended for current directive\n", "- `reward_anti_spam` — penalty if action repeated ≥ 4× in last 6 steps\n", "- `reward_fuel_discipline` — bonus low-fuel→idle, penalty low-fuel→large pulse" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "from qwen3_grpo_train import main as run_grpo\nrun_grpo()" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Reward + loss curves" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from rl_utils import TRAIN_LOG_DIR, plot_training_curves\n", "plot_training_curves(TRAIN_LOG_DIR / 'grpo_metrics.csv', TRAIN_LOG_DIR / 'grpo_metrics.png')\n", "from IPython.display import Image, display\n", "png = TRAIN_LOG_DIR / 'grpo_metrics.png'\n", "if png.exists(): display(Image(str(png)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Eval — baseline vs trained on all 4 tasks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "subprocess.run([sys.executable, str(ROOT / 'training' / 'eval_trained_model.py')], check=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import Image, display\n", "import pandas as pd\n", "out_dir = ROOT / 'outputs' / 'eval_trained'\n", "df = pd.read_csv(out_dir / 'trained_vs_baseline.csv')\n", "display(df)\n", "display(Image(str(out_dir / 'trained_vs_baseline.png')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Anti-reward-hacking sanity checks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "from collections import Counter\n", "from common import rollout_task, TASK_IDS\n", "from rl_utils import GRPO_OUTPUT_DIR, DEFAULT_MODEL, make_lora_controller\n", "ctrl = make_lora_controller(GRPO_OUTPUT_DIR, base_model=DEFAULT_MODEL)\n", "task = 'mission_ops_long_horizon'\n", "trace = rollout_task(task, ctrl, record_history=True)\n", "actions = [t['expert_action']['action_type'] for t in trace['trace']]\n", "counts = Counter(actions)\n", "idle_frac = counts.get('idle', 0) / max(len(actions), 1)\n", "top_action, top_count = counts.most_common(1)[0]\n", "print(f'task={task} success={trace[\"success\"]} reward_total={trace[\"reward_total\"]:.3f}')\n", "print(f'fuel_used={trace[\"fuel_used\"]:.2f} milestones={trace[\"milestones_completed_count\"]}')\n", "print(f'action_diversity_top={top_action}({top_count}) idle_fraction={idle_frac:.2%}')\n", "assert idle_frac < 0.85, 'all-idle exploit detected'\n", "assert top_count / len(actions) < 0.85, 'single-action exploit detected'\n", "out = ROOT / 'outputs' / 'training' / 'sample_rollout_flagship.json'\n", "out.write_text(json.dumps({k: trace[k] for k in ['task_id','success','reward_total','fuel_used','milestones_completed_count','reward_columns']}, indent=2))\n", "print('saved:', out)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Save adapter (LoRA only — never naive 4-bit→16-bit merge)\n", "\n", "GRPO trainer already saved to `trainer_output/qwen_grpo/`. To push to HF Hub:\n", "\n", "```python\n", "from huggingface_hub import HfApi\n", "HfApi().upload_folder(folder_path='trainer_output/qwen_grpo', repo_id='/orbital-thruster-grpo', repo_type='model')\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10" } }, "nbformat": 4, "nbformat_minor": 5 }