PEFT
qlora
sft
trl
qwen3
tmf921
intent-based-networking
network-slicing
rtx-6000-ada
ml-intern
nraptisss commited on
Commit
7a41ceb
Β·
verified Β·
1 Parent(s): 9b7e923

Confirm GRPO negative result with correct evaluation (SFT-merged + GRPO adapter). Result validated.

Browse files
Files changed (1) hide show
  1. results/grpo_verification.md +51 -0
results/grpo_verification.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GRPO Evaluation Verification
2
+
3
+ **Date:** 2026-05-13
4
+
5
+ ## Issue Identified
6
+
7
+ Previous GRPO evaluations loaded `base Qwen3-8B + GRPO adapter`, which was incorrect because the GRPO adapter was trained on top of `SFT-merged Qwen3-8B`. This could have produced misleading results.
8
+
9
+ ## Correct Evaluation
10
+
11
+ Re-evaluated with the correct model stack:
12
+ ```
13
+ Qwen3-8B (4-bit) β†’ + SFT adapter (merged) β†’ + GRPO v3 adapter
14
+ ```
15
+
16
+ Script: `scripts/evaluate_grpo_correct.py`
17
+
18
+ ## Results (Correct Stack)
19
+
20
+ | Split | Parse Rate | Field F1 | Exact Match |
21
+ |---|---:|---:|---:|
22
+ | test_in_distribution | 0.750 | 0.0057 | 0.0000 |
23
+ | test_template_ood | 0.750 | 0.0007 | 0.0000 |
24
+ | test_use_case_ood | 0.720 | 0.0034 | 0.0000 |
25
+ | test_sector_ood | 0.740 | 0.0007 | 0.0000 |
26
+ | test_adversarial | 0.000 | 0.0000 | 0.0000 |
27
+
28
+ ## Comparison with SFT Stage 1
29
+
30
+ | Metric | SFT Stage 1 | GRPO v3 (correct eval) | Delta |
31
+ |---|---:|---:|---:|
32
+ | JSON parse rate (ID) | 1.000 | 0.750 | -0.250 |
33
+ | Field F1 (ID) | 0.687 | 0.006 | -0.681 |
34
+ | Adversarial parse | 1.000 | 0.000 | -1.000 |
35
+
36
+ ## Conclusion
37
+
38
+ **GRPO genuinely degraded performance**, even with the correct model stack. The negative result is confirmed and scientifically valid.
39
+
40
+ The GRPO adapter learned noise rather than useful value-fidelity improvements, because:
41
+ 1. `frac_reward_zero_std` was 0.8-1.0 throughout training (zero reward variance β†’ zero gradient signal)
42
+ 2. Output entropy was 0.03-0.06 nats (model too deterministic for GRPO exploration)
43
+ 3. The only gradient came from KL regularization noise, which accumulated over 300 steps into destructive weight updates
44
+
45
+ ## Verified Claims for Paper
46
+
47
+ All of these are now scientifically supported:
48
+ - βœ… "GRPO fails for this task" β€” confirmed with correct evaluation
49
+ - βœ… "Entropy collapse prevents advantage estimation" β€” training logs show frac_reward_zero_std β‰ˆ 1.0
50
+ - βœ… "Model is too deterministic for GRPO" β€” entropy 0.03-0.06 vs typical RL tasks at 1.0+
51
+ - βœ… "SFT Stage 1 remains the best model" β€” 100% parse, 68.7% F1 vs GRPO's 75% parse, 0.6% F1