nraptisss
/

tmf921-intent-training

@@ -17,6 +17,8 @@ Current primary model: **stage-1 Qwen3-8B QLoRA adapter**.
 Stage 2 status: **diagnostic / not promoted**.
 Best stage-1 normalized metrics:
 | Split | JSON parse | Normalized field F1 | Normalized key F1 |
@@ -347,26 +349,6 @@ Current publication-ready assets:
 ---
-## Current open research questions
-1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
-2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
-3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
-4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?
-## Next recommended step
-Write the first manuscript draft using:
-- `paper/outline.md`,
-- `paper/tables.md`,
-- `PROJECT_JOURNAL.md`,
-- `results/stage1_vs_stage2_comparison.md`,
-- `results/baselines/zero_shot_vs_finetuned.md`,
-- `analysis/stage1_examples/failure_examples.md`.
----
 ## 2026-05-07 — O1/A1 semantic evaluator results added
 ### Goal
@@ -453,3 +435,116 @@ Artifacts added:
 - `results/semantic/o1_a1_stage1_vs_stage2.md`
 - `results/semantic/o1_a1_stage1_vs_stage2_summary.json`

 Stage 2 status: **diagnostic / not promoted**.
+Stage 3 (GRPO) status: **failed — negative result documented**.
 Best stage-1 normalized metrics:
 | Split | JSON parse | Normalized field F1 | Normalized key F1 |
 ---
 ## 2026-05-07 — O1/A1 semantic evaluator results added
 ### Goal
 - `results/semantic/o1_a1_stage1_vs_stage2.md`
 - `results/semantic/o1_a1_stage1_vs_stage2_summary.json`
+---
+## 2026-05-12 — GRPO post-SFT experiments (v1, v2, v3) — negative result
+### Goal
+Apply Group Relative Policy Optimization (GRPO) as a post-SFT stage to improve value fidelity on weak layers (O1 NRM, A1 policy). Based on the RL-Struct paper (arxiv:2512.00319) which showed GRPO + multi-component reward improves structured output quality by 15-20% over SFT alone.
+### Approach
+Three iterations were attempted, each fixing issues from the previous:
+**v1** (`scripts/train_grpo.py`):
+- Config: G=2, max_completion=768, temp=0.7, beta=0.01, lr=1e-5
+- 3 separate reward functions (validity, key F1, value F1) weighted 0.1/0.2/0.7
+- Result: Model catastrophically forgot JSON generation. JSON parse rate dropped from 100% to 38%. Field F1 dropped from 68.7% to 0.4%.
+- Root cause: temp=0.7 produced garbage completions. beta=0.01 was too low to prevent drift.
+**v2** (`scripts/train_grpo_v2.py`):
+- Fixes: temp=0.3, beta=0.1, single dense reward with partial credit, all rewards ≥ 0, lr=5e-6
+- Config: G=2, max_completion=512, dense reward function
+- Result: Training was stable (no divergence), but model still regressed. JSON parse rate 45%, field F1 ≈ 0%.
+- Root cause: `completions/clipped_ratio` was 0.5-1.0 throughout training. The 512-token limit truncated all completions before they could form valid JSON (dataset configs are 700-1300 tokens).
+**v3** (`scripts/train_grpo_v3.py`):
+- Fix: max_completion=1536 (full length), G=4, temp=0.4
+- GPU memory was only 6.1GB/48GB in v2, so we had massive headroom
+- Result: Same regression. JSON parse rate 40%, field F1 ≈ 0%.
+- Root cause: `frac_reward_zero_std` remained 0.8-1.0 throughout all 300 steps. The model's output entropy is extremely low (0.03-0.06), meaning even with G=4 and temp=0.4, all completions are nearly identical → same reward → zero GRPO gradient signal.
+### Key metrics from training logs
+| Metric | v1 | v2 | v3 | Interpretation |
+|---|---|---|---|---|
+| GPU memory | 6.1 GB | 6.1 GB | 6.1 GB | VRAM was never the bottleneck |
+| `frac_reward_zero_std` | 0.7-1.0 | 0.8-1.0 | 0.8-1.0 | No variance → no learning |
+| `clipped_ratio` | 0.3-0.5 | 0.5-1.0 | 0.3-0.7 | v2 was worst (512 too short) |
+| `entropy` | 0.08-0.17 | 0.03-0.06 | 0.03-0.06 | Model is extremely deterministic |
+| `dense_reward/mean` | N/A | 0.10-0.25 | 0.10-0.25 | Completions get partial credit but no variance |
+### Evaluation results (all three versions regressed)
+| Model | JSON parse | Field F1 (raw) | Status |
+|---|---:|---:|---|
+| **SFT Stage 1** | **100%** | **68.7%** | ✅ Best |
+| GRPO v1 | 38% | 0.4% | ❌ Catastrophic |
+| GRPO v2 | 45% | 0.03% | ❌ Catastrophic |
+| GRPO v3 | 40% | 0.08% | ❌ Catastrophic |
+### Root cause analysis
+The fundamental problem is that **GRPO requires variance between completions within each group** to compute advantages. The SFT model has extremely low output entropy (~0.04 nats) because it learned a near-deterministic mapping from prompts to JSON configs. Even at temperature 0.4 with G=4, all 4 completions are nearly identical, producing `frac_reward_zero_std ≈ 1.0` (zero reward variance).
+With zero variance, the GRPO advantage is zero for all tokens, so:
+- The LoRA adapters train on pure noise
+- Over 200-300 steps, this noise accumulates into drift
+- The model loses its SFT-learned JSON generation ability
+Additionally, the GRPO adapter is trained from random initialization on top of the SFT-merged base. During evaluation, loading `base + GRPO_adapter` (without the SFT merge) produces a model that never learned JSON generation in the first place.
+### Why GRPO works for math/reasoning but not for this task
+GRPO succeeds on math/reasoning (DeepSeekMath, R1) because:
+1. Math problems have **high output entropy** — many valid reasoning paths
+2. The reward is **binary** (correct/incorrect) — easy to get variance
+3. Temperature can be high (0.7-1.0) without generating garbage
+This structured JSON task is the opposite:
+1. **Low output entropy** — there's essentially one correct JSON config per prompt
+2. The reward is **continuous** (F1 score) — small differences, hard to distinguish
+3. Any temperature > 0 degrades the already near-perfect SFT output
+### Decision
+**GRPO is not suitable for this task in its current form.** Stage 1 SFT remains the primary model.
+### What would be needed for GRPO to work here (future work)
+1. **Best-of-N rejection sampling** instead of GRPO — generate N completions at temp=0, score them, fine-tune on the best ones (no variance needed)
+2. **DPO with synthetic preferences** — generate pairs at temp=0 vs temp=0.3, human/LLM-judge selects better one
+3. **Reward model + PPO** — train a separate reward model, use PPO which doesn't require within-group variance
+4. **Higher G (16-32)** with temperature 0.8+ — requires multi-GPU (A100x4+) to get enough diversity
+5. **Constrained decoding** (Outlines/XGrammar) — guarantee JSON validity, then use GRPO only for value selection within valid completions
+### Artifacts
+Scripts (preserved for reproducibility):
+- `scripts/train_grpo.py` (v1)
+- `scripts/train_grpo_v2.py` (v2)
+- `scripts/train_grpo_v3.py` (v3)
+Hub models (negative results, not for use):
+- https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-GRPO
+- https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-GRPO-v2
+- https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-GRPO-v3
+---
+## Current open research questions
+1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
+2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
+3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
+4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?
+5. Would Best-of-N rejection sampling or DPO with synthetic preferences improve value fidelity where GRPO failed?
+6. Would a code-pretrained base model (Qwen2.5-Coder-7B) perform better on structured JSON value assignment?
+## Next recommended steps
+1. **Best-of-N rejection sampling**: Generate N=8 completions per prompt at temp=0.3, score with the dense reward function, fine-tune on the top-1. No within-group variance needed.
+2. **Try Qwen2.5-Coder-7B as base**: Code models have better JSON/structured-output priors. Run the same SFT recipe and compare.
+3. **DPO with synthetic pairs**: Use the SFT model to generate pairs (temp=0 vs temp=0.5), label the better one with the reward function, train DPO.
+4. Write the final manuscript incorporating GRPO as a documented negative result.