Instructions to use nraptisss/tmf921-intent-training with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use nraptisss/tmf921-intent-training with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Add GRPO experiment journal entries (v1, v2, v3) — negative results documented with root cause analysis
Browse files- PROJECT_JOURNAL.md +115 -20
PROJECT_JOURNAL.md
CHANGED
|
@@ -17,6 +17,8 @@ Current primary model: **stage-1 Qwen3-8B QLoRA adapter**.
|
|
| 17 |
|
| 18 |
Stage 2 status: **diagnostic / not promoted**.
|
| 19 |
|
|
|
|
|
|
|
| 20 |
Best stage-1 normalized metrics:
|
| 21 |
|
| 22 |
| Split | JSON parse | Normalized field F1 | Normalized key F1 |
|
|
@@ -347,26 +349,6 @@ Current publication-ready assets:
|
|
| 347 |
|
| 348 |
---
|
| 349 |
|
| 350 |
-
## Current open research questions
|
| 351 |
-
|
| 352 |
-
1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
|
| 353 |
-
2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
|
| 354 |
-
3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
|
| 355 |
-
4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?
|
| 356 |
-
|
| 357 |
-
## Next recommended step
|
| 358 |
-
|
| 359 |
-
Write the first manuscript draft using:
|
| 360 |
-
|
| 361 |
-
- `paper/outline.md`,
|
| 362 |
-
- `paper/tables.md`,
|
| 363 |
-
- `PROJECT_JOURNAL.md`,
|
| 364 |
-
- `results/stage1_vs_stage2_comparison.md`,
|
| 365 |
-
- `results/baselines/zero_shot_vs_finetuned.md`,
|
| 366 |
-
- `analysis/stage1_examples/failure_examples.md`.
|
| 367 |
-
|
| 368 |
-
---
|
| 369 |
-
|
| 370 |
## 2026-05-07 — O1/A1 semantic evaluator results added
|
| 371 |
|
| 372 |
### Goal
|
|
@@ -453,3 +435,116 @@ Artifacts added:
|
|
| 453 |
- `results/semantic/o1_a1_stage1_vs_stage2.md`
|
| 454 |
- `results/semantic/o1_a1_stage1_vs_stage2_summary.json`
|
| 455 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
Stage 2 status: **diagnostic / not promoted**.
|
| 19 |
|
| 20 |
+
Stage 3 (GRPO) status: **failed — negative result documented**.
|
| 21 |
+
|
| 22 |
Best stage-1 normalized metrics:
|
| 23 |
|
| 24 |
| Split | JSON parse | Normalized field F1 | Normalized key F1 |
|
|
|
|
| 349 |
|
| 350 |
---
|
| 351 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 352 |
## 2026-05-07 — O1/A1 semantic evaluator results added
|
| 353 |
|
| 354 |
### Goal
|
|
|
|
| 435 |
- `results/semantic/o1_a1_stage1_vs_stage2.md`
|
| 436 |
- `results/semantic/o1_a1_stage1_vs_stage2_summary.json`
|
| 437 |
|
| 438 |
+
---
|
| 439 |
+
|
| 440 |
+
## 2026-05-12 — GRPO post-SFT experiments (v1, v2, v3) — negative result
|
| 441 |
+
|
| 442 |
+
### Goal
|
| 443 |
+
|
| 444 |
+
Apply Group Relative Policy Optimization (GRPO) as a post-SFT stage to improve value fidelity on weak layers (O1 NRM, A1 policy). Based on the RL-Struct paper (arxiv:2512.00319) which showed GRPO + multi-component reward improves structured output quality by 15-20% over SFT alone.
|
| 445 |
+
|
| 446 |
+
### Approach
|
| 447 |
+
|
| 448 |
+
Three iterations were attempted, each fixing issues from the previous:
|
| 449 |
+
|
| 450 |
+
**v1** (`scripts/train_grpo.py`):
|
| 451 |
+
- Config: G=2, max_completion=768, temp=0.7, beta=0.01, lr=1e-5
|
| 452 |
+
- 3 separate reward functions (validity, key F1, value F1) weighted 0.1/0.2/0.7
|
| 453 |
+
- Result: Model catastrophically forgot JSON generation. JSON parse rate dropped from 100% to 38%. Field F1 dropped from 68.7% to 0.4%.
|
| 454 |
+
- Root cause: temp=0.7 produced garbage completions. beta=0.01 was too low to prevent drift.
|
| 455 |
+
|
| 456 |
+
**v2** (`scripts/train_grpo_v2.py`):
|
| 457 |
+
- Fixes: temp=0.3, beta=0.1, single dense reward with partial credit, all rewards ≥ 0, lr=5e-6
|
| 458 |
+
- Config: G=2, max_completion=512, dense reward function
|
| 459 |
+
- Result: Training was stable (no divergence), but model still regressed. JSON parse rate 45%, field F1 ≈ 0%.
|
| 460 |
+
- Root cause: `completions/clipped_ratio` was 0.5-1.0 throughout training. The 512-token limit truncated all completions before they could form valid JSON (dataset configs are 700-1300 tokens).
|
| 461 |
+
|
| 462 |
+
**v3** (`scripts/train_grpo_v3.py`):
|
| 463 |
+
- Fix: max_completion=1536 (full length), G=4, temp=0.4
|
| 464 |
+
- GPU memory was only 6.1GB/48GB in v2, so we had massive headroom
|
| 465 |
+
- Result: Same regression. JSON parse rate 40%, field F1 ≈ 0%.
|
| 466 |
+
- Root cause: `frac_reward_zero_std` remained 0.8-1.0 throughout all 300 steps. The model's output entropy is extremely low (0.03-0.06), meaning even with G=4 and temp=0.4, all completions are nearly identical → same reward → zero GRPO gradient signal.
|
| 467 |
+
|
| 468 |
+
### Key metrics from training logs
|
| 469 |
+
|
| 470 |
+
| Metric | v1 | v2 | v3 | Interpretation |
|
| 471 |
+
|---|---|---|---|---|
|
| 472 |
+
| GPU memory | 6.1 GB | 6.1 GB | 6.1 GB | VRAM was never the bottleneck |
|
| 473 |
+
| `frac_reward_zero_std` | 0.7-1.0 | 0.8-1.0 | 0.8-1.0 | No variance → no learning |
|
| 474 |
+
| `clipped_ratio` | 0.3-0.5 | 0.5-1.0 | 0.3-0.7 | v2 was worst (512 too short) |
|
| 475 |
+
| `entropy` | 0.08-0.17 | 0.03-0.06 | 0.03-0.06 | Model is extremely deterministic |
|
| 476 |
+
| `dense_reward/mean` | N/A | 0.10-0.25 | 0.10-0.25 | Completions get partial credit but no variance |
|
| 477 |
+
|
| 478 |
+
### Evaluation results (all three versions regressed)
|
| 479 |
+
|
| 480 |
+
| Model | JSON parse | Field F1 (raw) | Status |
|
| 481 |
+
|---|---:|---:|---|
|
| 482 |
+
| **SFT Stage 1** | **100%** | **68.7%** | ✅ Best |
|
| 483 |
+
| GRPO v1 | 38% | 0.4% | ❌ Catastrophic |
|
| 484 |
+
| GRPO v2 | 45% | 0.03% | ❌ Catastrophic |
|
| 485 |
+
| GRPO v3 | 40% | 0.08% | ❌ Catastrophic |
|
| 486 |
+
|
| 487 |
+
### Root cause analysis
|
| 488 |
+
|
| 489 |
+
The fundamental problem is that **GRPO requires variance between completions within each group** to compute advantages. The SFT model has extremely low output entropy (~0.04 nats) because it learned a near-deterministic mapping from prompts to JSON configs. Even at temperature 0.4 with G=4, all 4 completions are nearly identical, producing `frac_reward_zero_std ≈ 1.0` (zero reward variance).
|
| 490 |
+
|
| 491 |
+
With zero variance, the GRPO advantage is zero for all tokens, so:
|
| 492 |
+
- The LoRA adapters train on pure noise
|
| 493 |
+
- Over 200-300 steps, this noise accumulates into drift
|
| 494 |
+
- The model loses its SFT-learned JSON generation ability
|
| 495 |
+
|
| 496 |
+
Additionally, the GRPO adapter is trained from random initialization on top of the SFT-merged base. During evaluation, loading `base + GRPO_adapter` (without the SFT merge) produces a model that never learned JSON generation in the first place.
|
| 497 |
+
|
| 498 |
+
### Why GRPO works for math/reasoning but not for this task
|
| 499 |
+
|
| 500 |
+
GRPO succeeds on math/reasoning (DeepSeekMath, R1) because:
|
| 501 |
+
1. Math problems have **high output entropy** — many valid reasoning paths
|
| 502 |
+
2. The reward is **binary** (correct/incorrect) — easy to get variance
|
| 503 |
+
3. Temperature can be high (0.7-1.0) without generating garbage
|
| 504 |
+
|
| 505 |
+
This structured JSON task is the opposite:
|
| 506 |
+
1. **Low output entropy** — there's essentially one correct JSON config per prompt
|
| 507 |
+
2. The reward is **continuous** (F1 score) — small differences, hard to distinguish
|
| 508 |
+
3. Any temperature > 0 degrades the already near-perfect SFT output
|
| 509 |
+
|
| 510 |
+
### Decision
|
| 511 |
+
|
| 512 |
+
**GRPO is not suitable for this task in its current form.** Stage 1 SFT remains the primary model.
|
| 513 |
+
|
| 514 |
+
### What would be needed for GRPO to work here (future work)
|
| 515 |
+
|
| 516 |
+
1. **Best-of-N rejection sampling** instead of GRPO — generate N completions at temp=0, score them, fine-tune on the best ones (no variance needed)
|
| 517 |
+
2. **DPO with synthetic preferences** — generate pairs at temp=0 vs temp=0.3, human/LLM-judge selects better one
|
| 518 |
+
3. **Reward model + PPO** — train a separate reward model, use PPO which doesn't require within-group variance
|
| 519 |
+
4. **Higher G (16-32)** with temperature 0.8+ — requires multi-GPU (A100x4+) to get enough diversity
|
| 520 |
+
5. **Constrained decoding** (Outlines/XGrammar) — guarantee JSON validity, then use GRPO only for value selection within valid completions
|
| 521 |
+
|
| 522 |
+
### Artifacts
|
| 523 |
+
|
| 524 |
+
Scripts (preserved for reproducibility):
|
| 525 |
+
- `scripts/train_grpo.py` (v1)
|
| 526 |
+
- `scripts/train_grpo_v2.py` (v2)
|
| 527 |
+
- `scripts/train_grpo_v3.py` (v3)
|
| 528 |
+
|
| 529 |
+
Hub models (negative results, not for use):
|
| 530 |
+
- https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-GRPO
|
| 531 |
+
- https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-GRPO-v2
|
| 532 |
+
- https://huggingface.co/nraptisss/Qwen3-8B-TMF921-Intent-GRPO-v3
|
| 533 |
+
|
| 534 |
+
---
|
| 535 |
+
|
| 536 |
+
## Current open research questions
|
| 537 |
+
|
| 538 |
+
1. Should O1 NRM be evaluated with a layer-specific semantic evaluator rather than flat field F1?
|
| 539 |
+
2. Are monitoring/report rows deterministic enough for exact field comparison, or do they require tolerance/semantic scoring?
|
| 540 |
+
3. Should Gen4 add canonical scenario-level fields to support official validators and cross-layer tuple generation?
|
| 541 |
+
4. Can official or derived validators be added for TMF921/CAMARA/A1/O1?
|
| 542 |
+
5. Would Best-of-N rejection sampling or DPO with synthetic preferences improve value fidelity where GRPO failed?
|
| 543 |
+
6. Would a code-pretrained base model (Qwen2.5-Coder-7B) perform better on structured JSON value assignment?
|
| 544 |
+
|
| 545 |
+
## Next recommended steps
|
| 546 |
+
|
| 547 |
+
1. **Best-of-N rejection sampling**: Generate N=8 completions per prompt at temp=0.3, score with the dense reward function, fine-tune on the top-1. No within-group variance needed.
|
| 548 |
+
2. **Try Qwen2.5-Coder-7B as base**: Code models have better JSON/structured-output priors. Run the same SFT recipe and compare.
|
| 549 |
+
3. **DPO with synthetic pairs**: Use the SFT model to generate pairs (temp=0 vs temp=0.5), label the better one with the reward function, train DPO.
|
| 550 |
+
4. Write the final manuscript incorporating GRPO as a documented negative result.
|