PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss commited on
Commit
aab71ba
·
verified ·
1 Parent(s): bb2b127

Add GRPO experiment journal entries (v1, v2, v3) — negative results documented with root cause analysis

Browse files
Files changed (1) hide show
  1. PROJECT_JOURNAL.md +115 -20
PROJECT_JOURNAL.md CHANGED
@@ -17,6 +17,8 @@ Current primary model: **stage-1 Qwen3-8B QLoRA adapter**.
17
 
18
  Stage 2 status: **diagnostic / not promoted**.
19
 
 
 
20
  Best stage-1 normalized metrics:
21
 
22
  | Split | JSON parse | Normalized field F1 | Normalized key F1 |
@@ -347,26 +349,6 @@ Current publication-ready assets:
347
 
348
  ---
349
 
350
- ## Current open research questions
351
-
352
- 1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
353
- 2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
354
- 3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
355
- 4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?
356
-
357
- ## Next recommended step
358
-
359
- Write the first manuscript draft using:
360
-
361
- - `paper/outline.md`,
362
- - `paper/tables.md`,
363
- - `PROJECT_JOURNAL.md`,
364
- - `results/stage1_vs_stage2_comparison.md`,
365
- - `results/baselines/zero_shot_vs_finetuned.md`,
366
- - `analysis/stage1_examples/failure_examples.md`.
367
-
368
- ---
369
-
370
  ## 2026-05-07 — O1/A1 semantic evaluator results added
371
 
372
  ### Goal
@@ -453,3 +435,116 @@ Artifacts added:
453
  - `results/semantic/o1_a1_stage1_vs_stage2.md`
454
  - `results/semantic/o1_a1_stage1_vs_stage2_summary.json`
455
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  Stage 2 status: **diagnostic / not promoted**.
19
 
20
+ Stage 3 (GRPO) status: **failed — negative result documented**.
21
+
22
  Best stage-1 normalized metrics:
23
 
24
  | Split | JSON parse | Normalized field F1 | Normalized key F1 |
 
349
 
350
  ---
351
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
352
  ## 2026-05-07 — O1/A1 semantic evaluator results added
353
 
354
  ### Goal
 
435
  - `results/semantic/o1_a1_stage1_vs_stage2.md`
436
  - `results/semantic/o1_a1_stage1_vs_stage2_summary.json`
437
 
438
+ ---
439
+
440
+ ## 2026-05-12 — GRPO post-SFT experiments (v1, v2, v3) — negative result
441
+
442
+ ### Goal
443
+
444
+ Apply Group Relative Policy Optimization (GRPO) as a post-SFT stage to improve value fidelity on weak layers (O1 NRM, A1 policy). Based on the RL-Struct paper (arxiv:2512.00319) which showed GRPO + multi-component reward improves structured output quality by 15-20% over SFT alone.
445
+
446
+ ### Approach
447
+
448
+ Three iterations were attempted, each fixing issues from the previous:
449
+
450
+ **v1** (`scripts/train_grpo.py`):
451
+ - Config: G=2, max_completion=768, temp=0.7, beta=0.01, lr=1e-5
452
+ - 3 separate reward functions (validity, key F1, value F1) weighted 0.1/0.2/0.7
453
+ - Result: Model catastrophically forgot JSON generation. JSON parse rate dropped from 100% to 38%. Field F1 dropped from 68.7% to 0.4%.
454
+ - Root cause: temp=0.7 produced garbage completions. beta=0.01 was too low to prevent drift.
455
+
456
+ **v2** (`scripts/train_grpo_v2.py`):
457
+ - Fixes: temp=0.3, beta=0.1, single dense reward with partial credit, all rewards ≥ 0, lr=5e-6
458
+ - Config: G=2, max_completion=512, dense reward function
459
+ - Result: Training was stable (no divergence), but model still regressed. JSON parse rate 45%, field F1 ≈ 0%.
460
+ - Root cause: `completions/clipped_ratio` was 0.5-1.0 throughout training. The 512-token limit truncated all completions before they could form valid JSON (dataset configs are 700-1300 tokens).
461
+
462
+ **v3** (`scripts/train_grpo_v3.py`):
463
+ - Fix: max_completion=1536 (full length), G=4, temp=0.4
464
+ - GPU memory was only 6.1GB/48GB in v2, so we had massive headroom
465
+ - Result: Same regression. JSON parse rate 40%, field F1 ≈ 0%.
466
+ - Root cause: `frac_reward_zero_std` remained 0.8-1.0 throughout all 300 steps. The model's output entropy is extremely low (0.03-0.06), meaning even with G=4 and temp=0.4, all completions are nearly identical → same reward → zero GRPO gradient signal.
467
+
468
+ ### Key metrics from training logs
469
+
470
+ | Metric | v1 | v2 | v3 | Interpretation |
471
+ |---|---|---|---|---|
472
+ | GPU memory | 6.1 GB | 6.1 GB | 6.1 GB | VRAM was never the bottleneck |
473
+ | `frac_reward_zero_std` | 0.7-1.0 | 0.8-1.0 | 0.8-1.0 | No variance → no learning |
474
+ | `clipped_ratio` | 0.3-0.5 | 0.5-1.0 | 0.3-0.7 | v2 was worst (512 too short) |
475
+ | `entropy` | 0.08-0.17 | 0.03-0.06 | 0.03-0.06 | Model is extremely deterministic |
476
+ | `dense_reward/mean` | N/A | 0.10-0.25 | 0.10-0.25 | Completions get partial credit but no variance |
477
+
478
+ ### Evaluation results (all three versions regressed)
479
+
480
+ | Model | JSON parse | Field F1 (raw) | Status |
481
+ |---|---:|---:|---|
482
+ | **SFT Stage 1** | **100%** | **68.7%** | ✅ Best |
483
+ | GRPO v1 | 38% | 0.4% | ❌ Catastrophic |
484
+ | GRPO v2 | 45% | 0.03% | ❌ Catastrophic |
485
+ | GRPO v3 | 40% | 0.08% | ❌ Catastrophic |
486
+
487
+ ### Root cause analysis
488
+
489
+ The fundamental problem is that **GRPO requires variance between completions within each group** to compute advantages. The SFT model has extremely low output entropy (~0.04 nats) because it learned a near-deterministic mapping from prompts to JSON configs. Even at temperature 0.4 with G=4, all 4 completions are nearly identical, producing `frac_reward_zero_std ≈ 1.0` (zero reward variance).
490
+
491
+ With zero variance, the GRPO advantage is zero for all tokens, so:
492
+ - The LoRA adapters train on pure noise
493
+ - Over 200-300 steps, this noise accumulates into drift
494
+ - The model loses its SFT-learned JSON generation ability
495
+
496
+ Additionally, the GRPO adapter is trained from random initialization on top of the SFT-merged base. During evaluation, loading `base + GRPO_adapter` (without the SFT merge) produces a model that never learned JSON generation in the first place.
497
+
498
+ ### Why GRPO works for math/reasoning but not for this task
499
+
500
+ GRPO succeeds on math/reasoning (DeepSeekMath, R1) because:
501
+ 1. Math problems have **high output entropy** — many valid reasoning paths
502
+ 2. The reward is **binary** (correct/incorrect) — easy to get variance
503
+ 3. Temperature can be high (0.7-1.0) without generating garbage
504
+
505
+ This structured JSON task is the opposite:
506
+ 1. **Low output entropy** — there's essentially one correct JSON config per prompt
507
+ 2. The reward is **continuous** (F1 score) — small differences, hard to distinguish
508
+ 3. Any temperature > 0 degrades the already near-perfect SFT output
509
+
510
+ ### Decision
511
+
512
+ **GRPO is not suitable for this task in its current form.** Stage 1 SFT remains the primary model.
513
+
514
+ ### What would be needed for GRPO to work here (future work)
515
+
516
+ 1. **Best-of-N rejection sampling** instead of GRPO — generate N completions at temp=0, score them, fine-tune on the best ones (no variance needed)
517
+ 2. **DPO with synthetic preferences** — generate pairs at temp=0 vs temp=0.3, human/LLM-judge selects better one
518
+ 3. **Reward model + PPO** — train a separate reward model, use PPO which doesn't require within-group variance
519
+ 4. **Higher G (16-32)** with temperature 0.8+ — requires multi-GPU (A100x4+) to get enough diversity
520
+ 5. **Constrained decoding** (Outlines/XGrammar) — guarantee JSON validity, then use GRPO only for value selection within valid completions
521
+
522
+ ### Artifacts
523
+
524
+ Scripts (preserved for reproducibility):
525
+ - `scripts/train_grpo.py` (v1)
526
+ - `scripts/train_grpo_v2.py` (v2)
527
+ - `scripts/train_grpo_v3.py` (v3)
528
+
529
+ Hub models (negative results, not for use):
530
+ - https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-GRPO
531
+ - https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-GRPO-v2
532
+ - https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-GRPO-v3
533
+
534
+ ---
535
+
536
+ ## Current open research questions
537
+
538
+ 1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
539
+ 2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
540
+ 3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
541
+ 4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?
542
+ 5. Would Best-of-N rejection sampling or DPO with synthetic preferences improve value fidelity where GRPO failed?
543
+ 6. Would a code-pretrained base model (Qwen2.5-Coder-7B) perform better on structured JSON value assignment?
544
+
545
+ ## Next recommended steps
546
+
547
+ 1. **Best-of-N rejection sampling**: Generate N=8 completions per prompt at temp=0.3, score with the dense reward function, fine-tune on the top-1. No within-group variance needed.
548
+ 2. **Try Qwen2.5-Coder-7B as base**: Code models have better JSON/structured-output priors. Run the same SFT recipe and compare.
549
+ 3. **DPO with synthetic pairs**: Use the SFT model to generate pairs (temp=0 vs temp=0.5), label the better one with the reward function, train DPO.
550
+ 4. Write the final manuscript incorporating GRPO as a documented negative result.