bobber commited on
Commit
5464b79
·
verified ·
1 Parent(s): c3af311

Move docs to bobber/lex-fridman-interviewer-project

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. docs/AUTONOMOUS_SESSION_2026-03-30.md +0 -219
  2. docs/CURRENT_STATE_2026-03-20.md +0 -95
  3. docs/CURRENT_STATE_2026-03-23.md +0 -108
  4. docs/CURRENT_STATE_2026-03-26.md +0 -304
  5. docs/CURRENT_STATE_2026-03-29.md +0 -150
  6. docs/CURRENT_STATE_2026-03-30-evening.md +0 -81
  7. docs/CURRENT_STATE_2026-03-30.md +0 -159
  8. docs/DATA_CURATION_PLAN.md +0 -131
  9. docs/EVAL_FRAMEWORK_2026-03-29.md +0 -148
  10. docs/EVAL_RESULTS.md +0 -319
  11. docs/FULL_FINETUNE_PLAN_2026-03-20.md +0 -76
  12. docs/FUNCTIONAL_EVAL_DESIGN.md +0 -134
  13. docs/GRPO_V11_DESIGN.md +0 -141
  14. docs/GRPO_V11_POSTMORTEM.md +0 -120
  15. docs/GRPO_V21_PLAN.md +0 -78
  16. docs/GRPO_V21_SUCCESS_ANALYSIS.md +0 -152
  17. docs/GRPO_V22_PLAN.md +0 -61
  18. docs/GRPO_V23_PLAN.md +0 -68
  19. docs/GRPO_V24_PLAN.md +0 -33
  20. docs/GRPO_V3_POSTMORTEM.md +0 -221
  21. docs/GRPO_V4_DESIGN.md +0 -423
  22. docs/GRPO_V4_POSTMORTEM.md +0 -129
  23. docs/GRPO_V7_DESIGN.md +0 -164
  24. docs/GRPO_V8_CHANGES.md +0 -55
  25. docs/GRPO_V8_ONPOLICY_PLAN.md +0 -320
  26. docs/GRPO_V8_TRAINING_FLOW.md +0 -184
  27. docs/KAGGLE_VS_OURS_COMPARISON.md +0 -106
  28. docs/LEXFRIDMAN_INTERVIEWER_PLAN.md +0 -183
  29. docs/LLAMA_FINETUNE_INVESTIGATION.md +0 -44
  30. docs/LORA_V1_ANALYSIS.md +0 -109
  31. docs/LORA_V2_NATIVE_RESULTS.md +0 -66
  32. docs/MAMBA_SSM_BUILD_NOTES.md +0 -114
  33. docs/NEMOTRON_GB10_DEEP_DIVE.md +0 -321
  34. docs/NEMO_RL_SETUP_NOTES.md +0 -168
  35. docs/ONNX_RETROSPECTIVE.md +0 -404
  36. docs/OPTION2_SFT_DISTILLATION_PLAN.md +0 -193
  37. docs/RETROSPECTIVE_2026-03-31.md +0 -168
  38. docs/REWARD_V10_DESIGN.md +0 -175
  39. docs/REWARD_V11_DESIGN.md +0 -187
  40. docs/REWARD_V13_DESIGN.md +0 -106
  41. docs/RL_VS_FILTERING_ANALYSIS_2026-03-30.md +0 -107
  42. docs/SSM_SCAN_FIX_PLAN.md +0 -202
  43. docs/SYNTHETIC_DATA_ANALYSIS_2026-03-30.md +0 -153
  44. docs/TECHNICAL_CHALLENGES.md +0 -181
  45. docs/TECHNICAL_REVIEW_2026-03-23.md +0 -205
  46. docs/TRAINING_PLAN_V5.md +0 -143
  47. docs/TRITON_PIPELINE_FIX.md +0 -137
  48. docs/TRITON_SSM_SCAN_PLAN.md +0 -114
  49. docs/VENV_SETUP.md +0 -118
  50. docs/VLLM_SETUP_NOTES.md +0 -146
docs/AUTONOMOUS_SESSION_2026-03-30.md DELETED
@@ -1,219 +0,0 @@
1
- # Autonomous Session — 2026-03-30 05:44 UTC to 13:00 UTC (9AM EST)
2
-
3
- ## Authorization
4
- Bobber authorized autonomous decisions at 05:44 UTC.
5
- All decisions, assumptions, and results logged here in real time.
6
-
7
- ## Starting State
8
- - **SFT v5 training:** running, step ~40/897, loss 8.2, ETA ~07:45 UTC
9
- - **W&B run:** `lex-sft-v5-4k-bnb8bit` → https://wandb.ai/bobber-cheng/lex-interviewer/runs/udqlwz88
10
- - **Base model benchmark:** 0.653 ± 0.333 (3-judge functional eval)
11
- - **Previous best SFT:** 0.467 (v4, 201 pairs — catastrophic forgetting)
12
-
13
- ## Decision Framework
14
-
15
- ### After eval completes:
16
-
17
- | Result | Interpretation | Action |
18
- |--------|---------------|--------|
19
- | score > 0.70 | Clear improvement — training working | Launch GRPO from sft-v5 checkpoint with reward_v11 |
20
- | 0.653–0.70 | Marginal improvement | Run LoRA variant (r=64, same data) to compare |
21
- | 0.60–0.653 | Slight degradation | Investigate: check if generated pairs (4,075) are hurting. Retrain on real Lex only (697 pairs) |
22
- | < 0.60 | Significant degradation | Likely overfitting. Check loss curve — if loss < 5, training ran too long. Try 1 epoch only |
23
- | Gibberish / < 0.30 | Complete failure | Model merge issue. Check generated questions manually first |
24
-
25
- ### Key assumptions:
26
- 1. Current loss trajectory (~8 at step 40, should reach 5-7 by step 897) = healthy full convergence
27
- 2. The 4,075 generated pairs are useful if they passed 3/3 judges — not perfect but signal
28
- 3. 3-judge eval on contaminated held-out set is the only eval available; score differences > 0.05 are meaningful
29
- 4. GRPO is the right follow-up IF SFT beats base — it starts from a stronger position
30
-
31
- ---
32
-
33
- ## Actions Log
34
-
35
- ### 05:44 UTC — Session start
36
- - Verified training running: step ~40, loss 8.2, GPU 96%, ETA ~07:45 UTC
37
- - Will not restart training (5 restarts already, converging well, let it finish)
38
- - Set automation for: eval on completion, decision logging, doc sync
39
-
40
- ### [In progress] Training monitoring
41
- - Checking every 30 min
42
- - Will auto-run eval when models/sft-v5 directory is populated
43
-
44
- ### 06:22 UTC — Monitor check (cron)
45
- - Training still running: step 285/897, epoch 0.95, loss 4.79
46
- - Loss trajectory: 8.2→4.8 over ~285 steps — healthy convergence
47
- - GPU utilization: 96%
48
- - Rate: ~5.3 steps/min → ETA completion: ~08:15 UTC (4:15 AM EDT)
49
- - Last 5 logged losses: 3.69, 4.25, 4.40, 4.46, 4.79 (some variance but trending down overall)
50
- - No action needed — training proceeding normally
51
-
52
- ---
53
-
54
- *This document is updated in real time as decisions are made.*
55
- *Created: 2026-03-30 05:44 UTC*
56
-
57
- ### 06:22 UTC — Monitor: training still running
58
- - Latest W&B run: run-20260330_052927-udqlwz88
59
- - Last metrics: {'loss': '4.464', 'grad_norm': '29.75', 'learning_rate': '8.099e-06', 'epoch': '0.938'}
60
- - Tmux progress: 32%|███▏ | 284/897 [52:27<1:39:24, 9.73s/it]
61
-
62
- ### 06:42 UTC — Monitor: training still running
63
- - Latest W&B run: run-20260330_052927-udqlwz88
64
- - Last metrics: {'loss': '2.802', 'grad_norm': '30.5', 'learning_rate': '6.246e-06', 'epoch': '1.322'}
65
- - Tmux progress: 44%|████▍ | 397/897 [1:12:32<1:20:51, 9.70s/it]
66
-
67
- ### 07:01 UTC — Monitor: training still running
68
- - Step 502/897 (56%), epoch 1.67, loss ~2.7-2.9
69
- - Loss has stabilized in 2.5-3.1 range (down from 8.2 at start, 4.8 at step 285)
70
- - "Writing model shards 100%" was a mid-training checkpoint save, not final — training resumed after
71
- - Rate: ~5.3 steps/min → ETA completion: ~08:15-08:30 UTC (4:15-4:30 AM EDT)
72
- - Recent losses: 2.703, 2.953, 3.115, 3.090, 2.902, 2.559, 2.978, 2.807, 2.667
73
- - No action needed — training proceeding normally, checkpoint saving working
74
-
75
- ### 07:21 UTC — Monitor: training still running
76
- - Latest W&B run: run-20260330_052927-udqlwz88
77
- - Last metrics: {'loss': '2.109', 'grad_norm': '21.5', 'learning_rate': '2.484e-06', 'epoch': '2.04'}
78
- - Tmux progress: 68%|██████▊ | 613/897 [1:52:30<43:00, 9.09s/it]
79
-
80
- ### 07:42 UTC — Monitor: training still running
81
- - Latest W&B run: run-20260330_052927-udqlwz88
82
- - Last metrics: {'loss': '1.855', 'grad_norm': '24.75', 'learning_rate': '1.004e-06', 'epoch': '2.409'}
83
- - Tmux progress: 80%|████████ | 722/897 [2:12:26<28:07, 9.64s/it]
84
-
85
- ### 07:42 UTC — Monitor check (cron)
86
- - Training still running: step 724/897 (81%), epoch ~2.41, loss 1.86
87
- - Loss trajectory: 8.2 → 4.8 → 2.8 → 1.86 — steady convergence across 3 epochs
88
- - Learning rate near zero (1e-6), final cooldown phase
89
- - Rate: ~9.6s/step → ETA completion: ~08:10 UTC (4:10 AM EDT)
90
- - ~173 steps remaining (~28 min)
91
- - No action needed — training in final stretch, model save + auto-eval expected shortly after
92
-
93
- ### 08:02 UTC — Monitor: training still running
94
- - Latest W&B run: run-20260330_052927-udqlwz88
95
- - Last metrics: {'loss': '2.008', 'grad_norm': '21.62', 'learning_rate': '1.297e-07', 'epoch': '2.794'}
96
- - Tmux progress: 93%|█████████▎| 835/897 [2:32:30<09:17, 8.99s/it]
97
-
98
- ### 08:01 UTC — Monitor check (cron)
99
- - Training still running: step 836/897 (93%), epoch ~2.79, loss 2.01
100
- - Learning rate near zero (1.3e-7), final steps
101
- - Rate: ~9.1s/step → ETA completion: ~08:10 UTC (4:10 AM EDT)
102
- - ~61 steps remaining (~9 min)
103
- - models/sft-v5 not yet saved — training hasn't finished writing
104
- - functional_judge_sft_v5_vs_base.json does NOT exist yet
105
- - No action needed — training in final stretch, auto-eval will trigger on next monitor cycle
106
-
107
- ### 08:22 UTC — Training finished; launching auto eval + decision script
108
-
109
- ### 08:22 UTC — Auto-eval session starting
110
-
111
- ### 08:22 UTC — Model found: 1 safetensors files
112
-
113
- ### 08:22 UTC — Running functional judge eval: sft-v5 vs base
114
-
115
- ### 08:22 UTC — Auto-eval failed (LoRA detection bug)
116
- - `auto_eval_and_decide.py` called `eval_functional_judge.py` with absolute path `/home/bobber/lex-ft/models/sft-v5`
117
- - `eval_functional_judge.py` used `model_path.startswith('models/')` to detect full fine-tune vs LoRA
118
- - Absolute path didn't match `models/` prefix → classified as LoRA → vLLM `enable_lora=True` → crash
119
- - **Fix applied:** Added `adapter_config.json` presence check + `model.safetensors` presence check for robust detection
120
-
121
- ### 08:28 UTC — Second failure: weight key mismatch
122
- - vLLM loaded `models/sft-v5` as full model but crashed: `KeyError: 'embedding.weight'`
123
- - Investigation: SFT training saved `backbone.embedding.weight` (singular) but base model uses `backbone.embeddings.weight` (plural)
124
- - **Fix applied:** Renamed key in safetensors file (backup at `model.safetensors.bak`)
125
- - This is a known quirk of the NemotronH model's HuggingFace vs internal naming
126
-
127
- ### 08:34 UTC — Manual re-run of eval (fixed)
128
- - Ran eval_functional_judge.py directly with both fixes applied
129
- - Base eval: 0.653 ± 0.333 (consistent with prior runs)
130
- - SFT-v5 eval: 0.667 ± 0.377
131
-
132
- ## Eval Results
133
-
134
- | Model | Score | on_topic | uses_guest | probing | Avg Words |
135
- |-------|-------|----------|------------|---------|-----------|
136
- | Base | 0.653 | 68% | 48% | 80% | 15 |
137
- | SFT v5 | 0.667 | 76% | 60% | 64% | 15 |
138
- | Delta | +0.014 | +8% | +12% | -16% | 0 |
139
-
140
- ## Decision: MARGINAL / NEUTRAL
141
-
142
- **Delta +0.014 is within noise (n=25, std ~0.35).**
143
-
144
- **Detailed analysis:**
145
- - **on_topic:** 68% → 76% (+8%) — slight improvement ✅
146
- - **uses_guest:** 48% → 60% (+12%) — meaningful improvement in referencing guest content ✅
147
- - **probing:** 80% → 64% (-16%) — significant degradation ❌ — SFT model asks more surface-level follow-ups
148
-
149
- **Interpretation:** SFT v5 learned to reference the guest's content better (uses_guest +12%) but lost depth (probing -16%). The net effect is roughly neutral. 3 epochs of full fine-tune on 4,075 pairs shifted the model's question style but didn't clearly improve overall quality.
150
-
151
- **No auto-launch of GRPO.** Per the decision framework, this falls in the "marginal/neutral" bucket.
152
-
153
- **Recommendations for Bobber:**
154
- 1. Try LoRA (r=64) instead of full fine-tune — less catastrophic style shift
155
- 2. Try 1 epoch only — the loss went from 8.2→1.86 which may be overfitting
156
- 3. Filter training data more aggressively — the 4,075 generated pairs may be diluting signal
157
- 4. Consider that probing degradation suggests the model is memorizing surface patterns
158
-
159
- ### 08:41 UTC — Session wrapping up
160
- - Eval results saved to `results/functional_judge_sft_v5_vs_base.json`
161
- - Disabling overnight monitor cron job
162
- - Cleaning up stale cron jobs
163
-
164
- ### 08:28 UTC — Eval failed with code 1
165
-
166
- ### 08:28 UTC — Eval failed — manual intervention needed
167
-
168
- ### 08:28 UTC — Auto eval failed
169
- ^^^(EngineCore pid=89624) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
170
- ^(EngineCore pid=89624) return self.__get_result()
171
- ^(EngineCore pid=89624) ^^^^^^^^^^^^^^^^^^^
172
- ^^^(EngineCore pid=89624) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
173
- ^(EngineCore pid=89624) raise self._exception
174
- ^^^(EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 82, in collective_rpc
175
- ^^(EngineCore pid=89624) result = run_method(self.driver_worker, method, args, kwargs)
176
- ^^(EngineCore pid=89624) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
177
- ^^^(EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
178
- ^^(EngineCore pid=89624) return func(*args, **kwargs)
179
- ^(EngineCore pid=89624) ^^^^^^^^^^^^^^^^^^^^^
180
- ^^^(EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 332, in execute_model
181
- ^(EngineCore pid=89624) return self.worker.execute_model(scheduler_output)
182
-
183
- (EngineCore pid=89624) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
184
- File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 817, in get_output
185
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
186
- (EngineCore pid=89624) return func(*args, **kwargs)
187
- (EngineCore pid=89624) ^^^^^^^^^^^^^^^^^^^^^
188
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 822, in execute_model
189
- (EngineCore pid=89624) output = self.model_runner.execute_model(
190
- (EngineCore pid=89624) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
191
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
192
- (EngineCore pid=89624) return func(*args, **kwargs)
193
- (EngineCore pid=89624) ^^^^^^^^^^^^^^^^^^^^^
194
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3625, in execute_model
195
- (EngineCore pid=89624) logits_indices, spec_decode_metadata = self._prepare_inputs(
196
- (EngineCore pid=89624) ^^^^^^^^^^^^^^^^^^^^^
197
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1877, in _prepare_inputs
198
- (EngineCore pid=89624) self.set_active_loras(
199
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/lora_model_runner_mixin.py", line 89, in set_active_loras
200
- (EngineCore pid=89624) return self._set_active_loras(
201
- (EngineCore pid=89624) ^^^^^^^^^^^^^^^^^^^^^^^
202
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/lora_model_runner_mixin.py", line 67, in _set_active_loras
203
- (EngineCore pid=89624) self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
204
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/lora/worker_manager.py", line 176, in set_active_adapters
205
- (EngineCore pid=89624) self._apply_adapters(requests)
206
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/lora/worker_manager.py", line 263, in _apply_adapters
207
- (EngineCore pid=89624) self.add_adapter(lora)
208
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/lora/worker_manager.py", line 279, in add_adapter
209
- (EngineCore pid=89624) lora = self._load_adapter(lora_request)
210
- (EngineCore pid=89624) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
211
- (EngineCore pid=89624) File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/lora/worker_manager.py", line 151, in _load_adapter
212
- (EngineCore pid=89624) raise LoRAAdapterNotFoundError(
213
- (EngineCore pid=89624) vllm.exceptions.LoRAAdapterNotFoundError: Loading lora adapter failed: No adapter found for /home/bobber/lex-ft/models/sft-v5
214
- raise self._format_exception(outputs) from None
215
- vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
216
-
217
- Processed prompts: 0%| | 0/25 [00:02<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
218
- [08:28 UTC] Eval failed with code 1
219
- [08:28 UTC] Eval failed — manual intervention needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/CURRENT_STATE_2026-03-20.md DELETED
@@ -1,95 +0,0 @@
1
- # Current State — Lex Fridman Interviewer Project
2
-
3
- Updated: 2026-03-20 UTC
4
- Project root: `/home/bobber/lex-ft`
5
- HF docs target: `bobber/lex-fridman-interviewer-project`
6
-
7
- ## 1) What exists right now
8
-
9
- ### Data
10
- - Main current training dataset: `data/interview_segments_v2.jsonl`
11
- - Size: **7,580 segments**
12
- - Validation status: **PASS** (`logs/validate_v2.log`)
13
- - Key stats from validation:
14
- - user-first segments: **75.5%**
15
- - assistant turns that are questions: **76%**
16
- - target-quality score: **4.95/6**
17
- - assistant length: P50 **26**, P75 **50**, P95 **112**, max **521** words
18
- - Important correction: older docs still say v2 data "needs validation"; that is now stale. The validation already passed.
19
-
20
- ### Training artifacts
21
- - Base model folder: `models/NVIDIA-Nemotron-3-Nano-4B`
22
- - LoRA/SFT v1 adapter: `models/lex-interviewer-sft`
23
- - LoRA/SFT v2 adapter: `models/lex-interviewer-sft-v2`
24
- - Merged-ish export folder for v2 eval/gguf prep: `models/lex-interviewer-v2-gguf`
25
- - Q8 GGUF export of v2: `models/lex-interviewer-v2-gguf_gguf/NVIDIA-Nemotron-3-Nano-4B.Q8_0.gguf`
26
-
27
- ### Scripts that matter
28
- - `scripts/train_sft.py` — current Unsloth training script (still LoRA-based, despite v2 data improvements)
29
- - `scripts/validate_training_data.py` — pre-train data gate
30
- - `scripts/restructure_data.py` — builds v2 data
31
- - `scripts/eval_via_server.py` — llama.cpp server eval
32
- - `scripts/eval_openai.py`, `scripts/eval_anthropic.py`, `scripts/eval_gemini.py`, `scripts/eval_gemini31.py`
33
- - `scripts/overnight_pipeline.sh` — Q8 eval pipeline + validation + training + export/eval chain
34
-
35
- ## 2) Current scorecard
36
-
37
- ### Best current interviewer behavior
38
- - **Nemotron 4B base**: about **4.35/5**
39
- - Still the strongest local interviewer behavior seen in this project so far
40
-
41
- ### Larger model references
42
- - Nemotron 30B Q8: **4.25/5**
43
- - Qwen3.5-35B-A3B Q8: **3.55/5**
44
- - Qwen3.5-27B Q8: **3.25/5**
45
-
46
- ### Fine-tuned models
47
- - SFT v1: poor
48
- - SFT v2 trained result: **2.00/5**, ~**292 words** average
49
- - Failure mode: lectures/monologues instead of sharp interviewer questions
50
-
51
- ## 3) What the repo tells us technically
52
-
53
- ### The current training script is still LoRA, not full fine-tune
54
- `train_sft.py` currently:
55
- - loads in 4-bit
56
- - applies LoRA with
57
- - `r=16`
58
- - target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
59
- - trainable params: **10,119,168 / 2,661,488,224 = 0.38%**
60
-
61
- This means the latest training run that produced the bad v2 result was **not** a full fine-tune. It was still adapter training on a Nemotron hybrid architecture.
62
-
63
- ### Why that matters
64
- This strongly supports the current project decision:
65
- - **production path:** try **full fine-tune** next
66
- - **research/debug path:** inspect whether LoRA touched meaningful parts of the hybrid architecture later
67
-
68
- ## 4) Most likely current diagnosis
69
-
70
- There are two different truths now:
71
- 1. **The v2 dataset looks materially better and passes validation**
72
- 2. **The current v2 trained model is still bad**
73
-
74
- That points away from "data is still obviously broken" and more toward one or more of:
75
- - LoRA is a poor fit for this Nemotron hybrid/Mamba-ish architecture
76
- - adapter targets are not reaching the behavior-critical parts of the model
77
- - training format/template still induces long assistant completions instead of concise interviewer questions
78
- - eval-time export/inference path may preserve some formatting/pathology from the LoRA route
79
-
80
- ## 5) Practical state of the project right now
81
-
82
- If we had to continue from scratch today, the cleanest reading is:
83
- - the **dataset is good enough to justify another run**
84
- - the **current LoRA route is the thing under suspicion**
85
- - the next serious experiment should be **Unsloth full fine-tune**, not another LoRA-only iteration
86
-
87
- ## 6) Immediate next action recommended
88
-
89
- Run a controlled **full fine-tune SFT** on `data/interview_segments_v2.jsonl` with:
90
- - step-50 checkpoint/eval gate
91
- - same eval harness as current leaderboard
92
- - no claims of success until the step-50 model beats base on actual interviewer eval
93
-
94
- ## 7) Known housekeeping issue
95
- - `/home/bobber/lex-ft` is **not a git repo** right now, so changes here are not versioned unless manually synced elsewhere.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/CURRENT_STATE_2026-03-23.md DELETED
@@ -1,108 +0,0 @@
1
- # Current State — Lex Fridman Interviewer Project
2
-
3
- Updated: 2026-03-23 UTC
4
- Project root: `/home/bobber/lex-ft`
5
- HF docs target: `bobber/lex-fridman-interviewer-project`
6
-
7
- ## 1) What exists right now
8
-
9
- ### Data
10
- - Training dataset: `data/interview_segments_v2.jsonl` — **7,580 segments**, validated ✅
11
- - 113 episodes crawled from lexfridman.com (official human transcripts with speaker labels)
12
- - Validation: 75.5% user-first, 76% question targets, avg score 4.95/6
13
-
14
- ### Training artifacts
15
- | Artifact | Type | Status | Score |
16
- |----------|------|--------|-------|
17
- | `models/NVIDIA-Nemotron-3-Nano-4B` | Base model | ✅ Best performer | **4.35/5** |
18
- | `models/lex-interviewer-sft` | LoRA SFT v1 | ❌ Failed | 2.10/5 |
19
- | `models/lex-interviewer-sft-v2` | LoRA SFT v2 | ❌ Failed | 2.00/5 |
20
- | `models/lex-interviewer-grpo-lora-v3` | LoRA GRPO v3 (125 steps) | ❌ Failed | N/A (gibberish) |
21
- | `models/lex-interviewer-grpo-lora-v3-step{25,50,75,100,125}` | GRPO checkpoints | ❌ Failed | N/A (gibberish) |
22
-
23
- ### Scripts
24
- | Script | Purpose |
25
- |--------|---------|
26
- | `scripts/train_grpo_v3.py` | GRPO v3 training (off-policy, proven broken) |
27
- | `scripts/reward_v3.py` | Heuristic reward for GRPO |
28
- | `scripts/train_sft.py` | SFT training (Unsloth LoRA) |
29
- | `scripts/validate_training_data.py` | Pre-train data gate |
30
- | `scripts/eval_via_server.py` | Eval via llama.cpp server |
31
-
32
- ## 2) What happened with GRPO v3
33
-
34
- **Full analysis:** `docs/GRPO_V3_POSTMORTEM.md`
35
-
36
- GRPO v3 used a hybrid architecture: llama.cpp (base model, Q4_K_M) for generation + HF transformers (base + LoRA) for log-prob computation and gradient updates. Training ran for 125 steps (9.72h) with positive reward metrics throughout.
37
-
38
- **Result:** The merged LoRA model generates complete gibberish. Training rewards were misleading — they measured the base model's generation quality, not the LoRA's.
39
-
40
- **6 critical gaps identified:**
41
- 1. 🔴 Off-policy generation — generator ≠ learner
42
- 2. 🔴 No KL reference — no anchor preventing divergence
43
- 3. 🔴 Token truncation — 512 max_length vs 800 max_tokens
44
- 4. 🟡 Architecture mismatch — LoRA can't modify Mamba SSM dynamics
45
- 5. 🟡 No credit assignment — uniform token weighting
46
- 6. 🟡 Thinking in training, not in reward — gradient/reward mismatch on `<think>` tokens
47
-
48
- ## 3) Current scorecard
49
-
50
- | Rank | Model | Score | Notes |
51
- |------|-------|-------|-------|
52
- | 🥇 | **Nemotron 4B base** | **4.35/5** | System prompt only, no fine-tuning |
53
- | 🥈 | GPT-5.4 | 4.30/5 | Cloud API |
54
- | 🥉 | Nemotron 30B-A3B Q8 | 4.25/5 | 7.5x larger, marginal improvement |
55
- | 4 | Gemini 3.1 Pro | 3.70/5 | Verbose |
56
- | 4 | Claude Opus 4.6 | 3.70/5 | Too wordy (121 words avg) |
57
- | ❌ | All fine-tuned variants | ≤3.20/5 | SFT, full SFT, and GRPO all worse than base |
58
-
59
- **Key insight:** A free, local 4B model beats all cloud APIs on this task. Every fine-tuning attempt has made it worse.
60
-
61
- ## 4) Why fine-tuning keeps failing
62
-
63
- Three different approaches have now failed:
64
-
65
- | Approach | Why it failed |
66
- |----------|--------------|
67
- | LoRA SFT (v1, v2) | Only 0.38% params trained; 38/42 Mamba layers untouched |
68
- | Full SFT (v1) | Think-tag format mismatch — good output trapped in `<think>` |
69
- | Full SFT (v2) | Data quality issues — `user\n` contamination, low question rate |
70
- | GRPO v3 (LoRA) | Off-policy RL — 6 critical gaps between reward and learning |
71
-
72
- Common thread: the Nemotron hybrid Mamba-2 architecture is hostile to standard fine-tuning approaches. The base model's interviewer behavior comes from its pre-training, and current methods either can't reach it (LoRA) or corrupt it (full SFT format issues, off-policy RL divergence).
73
-
74
- ## 5) Viable next steps
75
-
76
- ### Option A: Ship the base model (recommended)
77
- The base model already outperforms every cloud API tested. Deploy it with a good system prompt. No fine-tuning needed.
78
-
79
- ### Option B: SFT on curated GRPO completions
80
- GRPO training logged ~1000 completions. Filter for reward ≥ 0.5 to get high-quality Lex-style questions generated by the base model. Use these as supervised training data — no off-policy gap, no reward mismatch. This is distillation from the base model's own best outputs.
81
-
82
- ### Option C: On-policy GRPO
83
- Periodically merge LoRA → GGUF → use merged model for generation. Fixes the fatal off-policy gap but adds significant engineering complexity (merge + convert cycle every N steps) and was proven to produce garbage at merge time.
84
-
85
- ### Option D: Full SFT v3 with clean data
86
- Use the validated v2 data with proper chat template format. Previous full SFT v2 showed promise (3.20/5 vs 2.00/5 for LoRA) but still below base. Would need v3+ data with stricter quality filtering.
87
-
88
- ## 6) Disk situation
89
-
90
- ```
91
- /dev/nvme0n1p2 3.7T 3.4T 128G 97% /
92
- ```
93
-
94
- 128 GB free after cleaning 128 GB of HuggingFace cache. The remaining locked cache (gpt-oss-120b, GLM-4.6V) requires `sudo` to remove.
95
-
96
- ## 7) Key files for reference
97
-
98
- | File | What it tells you |
99
- |------|-------------------|
100
- | `docs/GRPO_V3_POSTMORTEM.md` | Why off-policy GRPO failed (6 gaps) |
101
- | `docs/EVAL_RESULTS.md` | Full eval leaderboard |
102
- | `docs/TECHNICAL_CHALLENGES.md` | All technical challenges + resolutions |
103
- | `logs/grpo_v3_completions.jsonl` | All GRPO training completions + rewards |
104
- | `logs/run_grpo_v3.log` | GRPO training log |
105
-
106
- ---
107
-
108
- *Previous state: `docs/CURRENT_STATE_2026-03-20.md`*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/CURRENT_STATE_2026-03-26.md DELETED
@@ -1,304 +0,0 @@
1
- # Current State — Lex Fridman Interviewer Project
2
-
3
- Updated: 2026-03-26 23:54 UTC
4
- Project root: `/home/bobber/lex-ft`
5
- HF docs target: `bobber/lex-fridman-interviewer-project`
6
-
7
- ---
8
-
9
- ## 1) Where We Are
10
-
11
- ### The core result
12
- **Base Nemotron 4B (GGUF, llama.cpp) scores 7.12/10 on the interviewer eval.**
13
- Every fine-tuning attempt so far has degraded the model. We now understand *why* — and have fixed the critical generation bug.
14
-
15
- | Approach | Score | Status |
16
- |---|---|---|
17
- | **Base model (llama.cpp GGUF)** | **7.12/10** | ✅ Best |
18
- | SFT v4 Triton (250 steps, LoRA) | 5.36/10 | ❌ Below base |
19
- | SFT v2 (100 steps, LoRA) | 5.08/10 | ❌ Below base |
20
- | GRPO v3 (off-policy, 125 steps) | gibberish | ❌ Off-policy broken |
21
- | GRPO v4 (mock rmsnorm) | oscillating loss | ❌ Mock kernel broken |
22
- | GRPO v6 runs 1-10 (full FT) | collapse/garbage | ❌ gen_max_tokens bug + unstable |
23
- | **GRPO v7 run10 (current)** | step 0: 4.04/5 best | 🔄 Running |
24
-
25
- ---
26
-
27
- ## 2) The Critical Bug — gen_max_tokens Was Way Too Small
28
-
29
- **This is the most important finding from today's session.**
30
-
31
- Nemotron 4B is a thinking model. Every response goes through a `<think>` block before answering. Token budget:
32
-
33
- - Thinking phase: ~600–1100 tokens (varies with prompt)
34
- - Actual answer (the question): ~20–100 tokens
35
- - **Total needed: ~700–1200 tokens minimum**
36
-
37
- All previous GRPO runs (v3–v6) used `gen_max_tokens=300` or `800`. This meant:
38
- - The model was **mid-thinking** when generation was cut off
39
- - `strip_thinking()` found no `</think>` → returned empty string
40
- - `reward_fn("", ...)` → **reward = 0.0 on every completion**
41
- - The only completions scoring >0 were those that skipped thinking entirely
42
- - GRPO was **actively training the model to not think** → template collapse
43
-
44
- Evidence: llama.cpp server with `max_tokens=4000` produces clean 14–27 word Lex-style questions scoring 4.0+/5. Same prompt with `max_new_tokens=800` → truncated mid-think → empty → reward=0.
45
-
46
- ### strip_thinking() was also wrong
47
-
48
- Old (broken):
49
- ```python
50
- re.sub(r'<think>.*?</think>', '', text) # only removes closed think blocks
51
- ```
52
-
53
- New (correct):
54
- ```python
55
- if '</think>' in text:
56
- return text[text.index('</think>') + len('</think>'):].strip()
57
- elif '<think>' in text:
58
- return '' # truncated mid-think — discard
59
- else:
60
- return text.strip() # no thinking block — use as-is
61
- ```
62
-
63
- ---
64
-
65
- ## 3) GRPO v6 — Full Fine-Tune Collapse Analysis
66
-
67
- Ran 10+ full fine-tune runs (42 layers, ~70GB model) before diagnosing the root cause.
68
-
69
- ### What happened in each run
70
-
71
- **Runs 1–10:** Various hyperparameters, all failed the same way:
72
- - Steps 0–20: Semi-coherent output, rewards mostly 0.0 (from gen_max_tokens truncation)
73
- - Steps 20–50: Template collapse into "We need to respond to user. Probably the user is asking..."
74
- - Steps 50+: Mode collapse — all 4 completions identical → zero advantage → zero gradient
75
-
76
- **Run 5 specific failure:** Triton workspace OOM — the backward kernel allocated a new ~3GB workspace each step (size varies with sequence length). After 3 steps, CUDA allocator fragmented. Step 3 took **51 minutes** to find a contiguous 3GB block.
77
-
78
- Fix applied: persistent pre-allocated workspace stored as function attribute `ssm_scan_triton_bwd._triton_bwd_workspace`. Same buffer reused every step.
79
-
80
- ### Root cause: full fine-tune is too unstable for this task
81
- - LoRA targets only 4 attention layers (1.03% of params) — stable
82
- - Full fine-tune trains all 42 layers — one bad gradient step destroys base model behavior
83
- - lr=1e-5 with no warmup → immediate catastrophic update
84
-
85
- ---
86
-
87
- ## 4) GRPO v7 Design
88
-
89
- LoRA + correct token budget + warmup + llama.cpp generation.
90
-
91
- ### Architecture
92
-
93
- ```
94
- Per step:
95
- 1. llama.cpp server generates 4 completions (enable_thinking=True via GGUF)
96
- → server returns content (visible answer) + reasoning_content (thinking)
97
- → no stripping needed — llama.cpp handles </think> boundary natively
98
- 2. reward_fn(content, prompt) for each completion
99
- 3. GRPO advantages (normalize within group)
100
- 4. HF model forward: compute log-probs on full completion (think + answer tokens)
101
- 5. reference_lp: LoRA disabled (same weights, base model behavior)
102
- 6. loss = -adv * mean_lp_visible + kl_coef * KL(policy||ref)
103
- 7. loss.backward(); optimizer.step()
104
- ```
105
-
106
- ### Config (run10, currently running)
107
- ```
108
- --steps 500
109
- --lr 2e-5
110
- --kl-coef 0.05
111
- --generations 4
112
- --gen-tokens 4000 # safe headroom for full thinking chain
113
- --force-think-close 800
114
- --warmup 30 # linear warmup to avoid early catastrophic update
115
- LoRA r=32, alpha=64, targets: q/k/v/o/gate/up/down_proj
116
- ```
117
-
118
- ### Generation: llama.cpp server (off-policy, for now)
119
-
120
- The HF model in Python has near-zero P(`</think>`) — it starts thinking but never closes the block. This is model behavior: NVIDIA trained it with their own reasoning infrastructure (vLLM + `selective_state_update` CUDA kernel).
121
-
122
- llama.cpp handles this via `reasoning_format: "deepseek"` + `thinking_forced_open: true` — it detects reasoning content and manages `</think>` injection.
123
-
124
- **Off-policy gap:** generations come from GGUF model, training updates HF model. This is the same gap that broke GRPO v3. Importance sampling correction is not yet implemented.
125
-
126
- ### run10 progress (as of 23:54 UTC)
127
-
128
- ```
129
- Step 0: reward mean=1.928 best: "When you say the self disappears during meditation,
130
- how does that experience feel different from ordinary states of mind?" → 4.04/5
131
- Step 1: reward mean=0.980 best: "What do you think the most profound consequence of
132
- unregulated genetic selection for intelligence might be, beyond the obvious?" → 3.92/5
133
- ```
134
-
135
- - Log: `results/grpo_v7_run10_stdout.log`
136
- - W&B: `wejcyyj5` — https://wandb.ai/bobber-cheng/lex-interviewer/runs/wejcyyj5
137
- - PID: 35834 (PGID=SID=35834, properly detached)
138
- - llama-server: PID 38286, port 30000
139
-
140
- ---
141
-
142
- ## 5) Major Breakthrough: mamba-ssm + causal-conv1d Built on GB10
143
-
144
- ### The problem
145
- All previous HF-based generation fell back to Python (slow, wrong decode):
146
- ```
147
- WARNING: The fast path is not available because one of
148
- (selective_state_update, causal_conv1d_fn, causal_conv1d_update) is None.
149
- Falling back to the naive implementation.
150
- ```
151
-
152
- Without `selective_state_update`, the decode step runs in Python with BF16 — and produces near-zero P(`</think>`) because the SSM state diverges from training conditions.
153
-
154
- ### Root cause of build failure
155
- DGX Spark uses **CUDA 13.0**, but `/usr/bin/nvcc` symlinks to a **CUDA 12.0** toolchain. `pip install mamba-ssm` picked up the wrong compiler and failed with CUDA version mismatch.
156
-
157
- The actual CUDA 13.0 nvcc is at `/usr/local/cuda-13.0/bin/nvcc`.
158
-
159
- ### Fix
160
- ```bash
161
- CUDA_HOME=/usr/local/cuda-13.0 \
162
- PATH=/usr/local/cuda-13.0/bin:$PATH \
163
- TORCH_CUDA_ARCH_LIST="12.0" \
164
- pip install mamba-ssm causal-conv1d --no-build-isolation
165
- ```
166
-
167
- Compiled aarch64 CUDA kernels for SM 12.1 (Blackwell). Build took ~45 minutes.
168
-
169
- ### Result
170
- ```python
171
- import mamba_ssm
172
- from mamba_ssm.ops.triton.selective_state_update import selective_state_update
173
- # → <function selective_state_update at 0xf03c0de6bce0> ✅ (not None)
174
-
175
- import causal_conv1d
176
- # → causal_conv1d_fn: <function causal_conv1d_fn at ...> ✅ (not None)
177
- ```
178
-
179
- Installed versions:
180
- - `mamba_ssm-2.3.1-cp312-cp312-linux_aarch64.whl` (351 MB)
181
- - `causal_conv1d-1.6.1`
182
- - Cached at: `~/.cache/pip/wheels/28/83/54/d45107838...`
183
-
184
- ### What this unlocks
185
-
186
- With `selective_state_update` available, the HF model decode runs via the real CUDA kernel (same code path as NVIDIA's training). This should fix the P(`</think>`) ≈ 0 issue and enable **fully on-policy GRPO** — removing the off-policy gap entirely.
187
-
188
- Testing in progress at session end. If confirmed working, GRPO v8 will switch generation from llama.cpp to the HF model directly.
189
-
190
- ---
191
-
192
- ## 6) vLLM — Working ✅ (2026-03-27)
193
-
194
- vLLM 0.18.0 successfully loads NemotronH via its own `nemotron_h.py` backend. **No mamba-ssm needed** — vLLM has its own Mamba-2 kernel.
195
-
196
- **Previous attempts failed** because `pip install vllm` pulled `torch 2.10.0+cpu` (CPU-only). Fix: install CUDA torch first.
197
-
198
- ```bash
199
- cd /home/bobber/lex-ft && source .venv-vllm/bin/activate
200
- pip install torch==2.10.0+cu130 --index-url https://download.pytorch.org/whl/cu130
201
- pip install vllm
202
- ```
203
-
204
- Confirmed working: `</think>` closes naturally, batch generation works, output quality is good.
205
- See `docs/VLLM_SETUP_NOTES.md` for full installation and usage guide.
206
-
207
- ---
208
-
209
- ## 7) NVIDIA's Actual Training Method (from NeMo repo)
210
-
211
- From reviewing the [NeMo Nemotron training branch](https://github.com/NVIDIA-NeMo/Nemotron/tree/nano-3-training):
212
-
213
- - **Generation:** vLLM (which has native Mamba-2 CUDA support)
214
- - **Reasoning control:** `enable_thinking=True` for RL training, `enable_thinking=False` for no-think mode
215
- - `enable_thinking=True` → prompt ends with `<think>\n` (model thinks then answers)
216
- - `enable_thinking=False` → prompt ends with `<think></think>` (model answers directly)
217
- - **Off-policy correction:** `use_importance_sampling_correction=True`
218
- - **10% no-think samples** mixed in during training
219
- - **Verifiable tasks** (math, code, JSON schema) for binary rewards
220
-
221
- Our approach differs: we're fine-tuning on a style task (interviewer) rather than verifiable capabilities. The reward is heuristic, not binary. This makes RL harder — but the approach is the same.
222
-
223
- ---
224
-
225
- ## 7) Current Codebase
226
-
227
- ### Core files
228
- | File | Purpose |
229
- |---|---|
230
- | `grpo_v7_train.py` | GRPO v7 — LoRA, warmup, 4000-token budget |
231
- | `run_grpo_v7.sh` | Launch script (detached) |
232
- | `ssm_generate.py` | `generate_cached`, `generate_cached_batch` |
233
- | `ssm_scan_triton.py` | Triton fwd+bwd SSM kernel |
234
- | `ssm_scan_backward.py` | Sequential backward reference |
235
- | `ssm_decode_fused.py` | Fused Triton decode step |
236
- | `tests/validate_correct_scan.py` | Mamba layer patcher |
237
- | `grpo_v6_train.py` | GRPO v6 — full fine-tune (deprecated) |
238
-
239
- ### Results on disk
240
- | Path | Contents |
241
- |---|---|
242
- | `results/grpo_v7_run10_stdout.log` | Current run (live) |
243
- | `results/grpo_v7_run*/` | Earlier v7 smoke tests |
244
-
245
- ---
246
-
247
- ## 8) Eval Leaderboard (as of 2026-03-26)
248
-
249
- ### 5-Score (heuristic, used in GRPO reward)
250
- | Rank | Model | Score | Words avg |
251
- |---|---|---|---|
252
- | 🥇 | **Base Nemotron 4B (llama.cpp)** | **4.35/5** | 59 |
253
- | 2 | GPT-5.4 | 4.30/5 | 52 |
254
- | 3 | Nemotron 30B-A3B Q8 | 4.25/5 | 62 |
255
- | 4 | Gemini 3.1 Pro | 3.70/5 | 82 |
256
- | 4 | Claude Opus 4.6 | 3.70/5 | 121 |
257
- | 6 | Qwen3.5-35B-A3B Q8 | 3.55/5 | 51 |
258
- | 7 | SFT v1 (LoRA) | 2.10/5 | — |
259
- | 8 | SFT v2 (LoRA) | 2.00/5 | 292 |
260
-
261
- ### 10-Score (canonical)
262
- | Rank | Model | Score |
263
- |---|---|---|
264
- | 🥇 | **Base Nemotron 4B (llama.cpp)** | **7.12/10** |
265
- | 2 | SFT v4 Triton | 5.36/10 |
266
- | 3 | SFT v2 | 5.08/10 |
267
-
268
- ---
269
-
270
- ## 9) What's Next: GRPO v8 (On-Policy with vLLM)
271
-
272
- v7 run10 was terminated (was off-policy, had stalled). Next is **GRPO v8** — fully on-policy using vLLM.
273
-
274
- ### Design (from NVIDIA Nemotron 3 Nano cookbook)
275
- - vLLM generates from the **same HF weights** being trained (on-policy)
276
- - Importance sampling correction for vLLM/HF probability mismatch
277
- - PPO-style ratio clipping: `ratio_clip_min=0.2, ratio_clip_max=0.28` (asymmetric, from NVIDIA)
278
- - Overlong filtering: exclude truncated completions from loss
279
- - 10% no-think mixing (`enable_thinking=False`) — from NVIDIA recipe
280
- - lr: 3e-6 (lower than v7's 2e-5, matches NVIDIA post-SFT RL lr)
281
- - LoRA r=32, warmup 30 steps
282
-
283
- ### Two-process design
284
- Training process (.venv-train) holds HF model + LoRA optimizer.
285
- Generation process (.venv-vllm) holds vLLM.
286
- Communicate via checkpoint files: train → save LoRA → vLLM reload → generate → train...
287
-
288
- See `docs/GRPO_V8_ONPOLICY_PLAN.md` for complete design.
289
-
290
- ---
291
-
292
- ## 10) Key Technical Documents
293
-
294
- | Doc | Summary |
295
- |---|---|
296
- | `docs/GRPO_V3_POSTMORTEM.md` | 6 gaps that broke off-policy GRPO |
297
- | `docs/GRPO_V4_POSTMORTEM.md` | Why rmsnorm mock silently broke training |
298
- | `docs/TRAINING_PLAN_V5.md` | Full plan to beat base model |
299
- | `docs/TRITON_SSM_SCAN_PLAN.md` | Triton kernel design |
300
- | `docs/EVAL_RESULTS.md` | Full leaderboard + dimension breakdown |
301
-
302
- ---
303
-
304
- *Previous state: `docs/CURRENT_STATE_2026-03-23.md`*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/CURRENT_STATE_2026-03-29.md DELETED
@@ -1,150 +0,0 @@
1
- # Current State — Lex Fridman Interviewer Project
2
-
3
- Updated: 2026-03-29 21:30 UTC
4
- Project root: `/home/bobber/lex-ft`
5
- HF docs target: `bobber/lex-fridman-interviewer-project`
6
-
7
- ---
8
-
9
- ## Today's Work (2026-03-29) — Full Summary
10
-
11
- ### Major Discoveries
12
-
13
- #### 1. Eval Contamination — 50/50 Held-Out Set is Training Data
14
- The `data/held_out_eval.jsonl` (50 prompts) was built from the same transcript crawl as `interview_segments_v2.jsonl`. Every guest statement in the eval appears verbatim in training data. Every `ref_question` is identical to the training label.
15
-
16
- **Impact:** All previous eval scores (v2, v3, clean) are invalid as measures of generalization. The base model's apparent strength (7.39/10) is partly explained by pretraining on Lex transcripts (public internet) and eval contamination — not pure capability.
17
-
18
- #### 2. eval_v3 Circularity Problem
19
- Even with a clean eval set, `eval_v3` (Claude Opus judge, Lex-style dimensions) is circular:
20
- - Training reward (log-ratio) optimizes toward Lex-style outputs
21
- - Eval measures Lex-style conformance
22
- - Both signals measure the same proxy → cannot detect real quality changes
23
-
24
- #### 3. GRPO Makes Things Marginally Worse (Confirmed by 3 Eval Methods)
25
-
26
- | Eval method | Base | GRPO step_100 | Delta |
27
- |---|---|---|---|
28
- | eval_v3 (cosine, held-out) | 7.39/10 | 7.38/10 | -0.01 |
29
- | eval_functional (vLLM + cosine sim, held-out) | 0.363 | 0.331 | -0.032 |
30
- | **eval_functional_judge (vLLM + Qwen3.5-4B, held-out)** | **0.653** | **0.613** | **-0.040** |
31
-
32
- All three evals agree: GRPO step_100 is worse than base, even on contaminated data.
33
-
34
- Root cause (from per-prompt analysis):
35
- - `uses_guest` dropped 8pp: step_100 references the specific guest statement less
36
- - `probing` dropped 4-8pp: step_100 asks for elaboration instead of probing deeper
37
- - The log-ratio reward pushed toward "average Lex question" (shorter, more archetypal), stripping the contextual specificity that makes questions functionally good
38
-
39
- #### 4. New Canonical Eval: eval_functional_judge.py
40
- Built and validated a domain-agnostic functional eval:
41
- - **Policy generation:** vLLM (Nemotron 4B + optional LoRA), batch mode, ~10s for 25 questions
42
- - **Scoring:** Qwen3.5-4B binary judges (3 questions per prompt)
43
- - `on_topic`: Is the question about the same subject?
44
- - `uses_guest`: Does it reference the guest's specific words/concepts?
45
- - `probing`: Does it probe deeper, not just ask for repetition?
46
- - **Score:** mean of 3 binary votes, normalized 0-1
47
- - **Validated:** SHARP=3/4, GENERIC=2/4, OFFTOPIC=1/4, RESTATE=2/4 across all domains including niche technical
48
-
49
- **Why Qwen3.5-4B works (0.8B doesn't):**
50
- - 0.8B: logit gap Yes/No = ~1.4 (no real discrimination)
51
- - 0.8B with CoT: generates "Thinking Process..." template, doesn't answer
52
- - 4B: clean YES/NO with `enable_thinking=False`, correct discrimination across domains
53
-
54
- **Venv:** `.venv-vllm` (Python 3.12) — upgraded transformers to 5.3.0 (vllm warns but works)
55
-
56
- ```bash
57
- # Canonical eval command:
58
- cd /home/bobber/lex-ft && source .venv-vllm/bin/activate
59
- python -u eval_functional_judge.py \
60
- --model base \
61
- --model2 results/grpo_v8/ckpt_step_100 \
62
- --n 25 \
63
- --output results/functional_judge_base_vs_step100_v2.json
64
- ```
65
-
66
- #### 5. Small Model Judge Research
67
- Tested Qwen3.5-0.8B as decomposed binary judge — failed:
68
- - Forced-choice (next-token logit sampling): ~75/25 split regardless of content, no discrimination
69
- - With CoT (`enable_thinking=True`): generates "Thinking Process" template, no final YES/NO
70
- - With `enable_thinking=False`: correct format but inverted judgments (RESTATE scores higher than SHARP)
71
- - Root cause: 0.8B can't model "what's absent from text" — required for novelty judgment
72
-
73
- **Threshold:** 4B is the minimum for reliable interview question quality judgment.
74
-
75
- #### 6. Training Data Quality Analysis
76
- Of 17,778 (guest → Lex question) pairs in training data:
77
- - **6,460 (36%)** "lazy" questions: <8 words or specificity < 0.02 (e.g. "How do you approach that?")
78
- - **7,072 (40%)** real questions: end in `?`, 5-60 words
79
- - **1,287 (7%)** "sharp" questions: >15 words, specificity > 0.1 (reference guest's specific content)
80
-
81
- The training set has a 5:1 ratio of lazy to sharp questions — we've been training on noise.
82
-
83
- #### 7. GRPO v11 Training (2 runs)
84
- Both used reward_v11 (info-gain via Qwen 0.5B sim), resumed from GRPO v8 step_100:
85
- - Run 1: 148 steps, saved ckpts at step_50 + step_100
86
- - Run 2: 108 steps (killed by gateway), no new ckpt
87
- - Reward positive throughout (1.1-1.4/5), no collapse
88
- - But functional eval confirms: didn't improve quality
89
-
90
- ---
91
-
92
- ## Eval Leaderboard (as of 2026-03-29, functional judge — canonical going forward)
93
-
94
- | Model | Judge Score | on_topic | uses_guest | probing | Notes |
95
- |-------|-------------|----------|------------|---------|-------|
96
- | **Base Nemotron 4B** | **0.653 ± 0.333** | 68% | 48% | 80% | ← best |
97
- | GRPO v11 step_100 | 0.613 ± 0.349 | 68% | 40% | 76% | log-ratio reward |
98
-
99
- **Note:** Scores are on contaminated held-out set. True generalization performance unknown until clean eval set is built.
100
-
101
- ---
102
-
103
- ## Root Cause Summary: Why Nothing Has Beaten Base
104
-
105
- | Approach | Why it failed |
106
- |---|---|
107
- | LoRA SFT (v1, v2) | Pattern matches surface; suppresses base model reasoning |
108
- | Full SFT (v3, v4, v5) | Same; base model CoT generates better questions than SFT patterns |
109
- | Off-policy GRPO (v3) | Generator ≠ learner — fundamentally broken |
110
- | On-policy GRPO v4–v7 | reward_v8 (heuristic) gamed at step ~50 |
111
- | On-policy GRPO v8/v11 | reward_v10/v11 (log-ratio) is "average Lex" → strips specificity |
112
-
113
- **The real bottleneck:** Training data has 5:1 lazy-to-sharp ratio. Reward signal optimizes toward the modal (average) output, not the exceptional one. The base model generates both; training regresses toward the mean.
114
-
115
- ---
116
-
117
- ## Next Step: Data Curation
118
-
119
- **Plan:**
120
- 1. Filter existing 7,580 training segments → keep only the 1,287 "sharp" (guest → question) pairs
121
- 2. Score those pairs with `eval_functional_judge` to get ground-truth quality ranking
122
- 3. SFT on top-tier pairs only (target: ~500 highest-scoring)
123
- 4. Eval with `eval_functional_judge` vs base — first clean signal
124
-
125
- **Why this can beat base:**
126
- - SFT on base model outputs → ceiling is base
127
- - SFT on curated *ground truth human Lex* at his best → ceiling is Lex's best questions
128
- - Lex's best questions score 0.368 mean on info-gain (validated) — above base model's 0.363
129
-
130
- **Build a genuinely clean eval set** (separate task):
131
- - Crawl ~10 recent Lex episodes NOT in the 113-episode training set
132
- - Use as held-out eval going forward
133
-
134
- ---
135
-
136
- ## Infrastructure Notes
137
-
138
- | File | Purpose | Status |
139
- |------|---------|--------|
140
- | `eval_functional_judge.py` | Canonical eval — vLLM + Qwen3.5-4B judges | ✅ Production ready |
141
- | `eval_functional_vllm.py` | Alt eval — vLLM + cosine sim (weaker) | ✅ Works but deprecated |
142
- | `eval_clean.py` | Old eval — contaminated held-out, eval_v3 scorer | ❌ Retire |
143
- | `eval_judge_test.py` | Small model judge research | Archive |
144
- | `.venv-vllm` | Canonical Python env (transformers 5.3.0 + vllm 0.18.0) | ✅ Use this |
145
- | `results/functional_judge_base_vs_step100_v2.json` | Latest eval results | ✅ Ground truth |
146
-
147
- ---
148
-
149
- *Previous state: earlier version of this file (2026-03-29 17:00 UTC)*
150
- *Created: 2026-03-29 21:30 UTC*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/CURRENT_STATE_2026-03-30-evening.md DELETED
@@ -1,81 +0,0 @@
1
- # Current State — Lex Fridman Interviewer Project
2
- *Updated: 2026-03-30 19:19 UTC*
3
-
4
- ---
5
-
6
- ## Eval Leaderboard (functional judge — canonical)
7
-
8
- | Rank | Model | Score | on_topic | uses_guest | probing | Notes |
9
- |------|-------|-------|----------|------------|---------|-------|
10
- | 🥇 | **LoRA v1** (r=64, 1ep, original data) | **0.733** | 72% | 56% | 92% | Best — first to beat base |
11
- | 2 | Base Nemotron 4B | 0.653 | 68% | 48% | 80% | Pretrained baseline |
12
- | 3 | LoRA v2 (filtered+upsampled) | 0.640 | 64% | 48% | 80% | No improvement — flat |
13
- | 4 | SFT v5 (LoRA r≈16, 3ep) | 0.667 | 76% | 60% | 64% | Probing damaged |
14
- | 5 | GRPO v11 step_100 | 0.613 | 68% | 40% | 76% | reward_v11 anti-correlated |
15
-
16
- **Bottleneck:** uses_guest at 56% (need 70%+). Data interventions have failed.
17
-
18
- ---
19
-
20
- ## What We've Tried Today (2026-03-30)
21
-
22
- | Experiment | Hypothesis | Result | Lesson |
23
- |---|---|---|---|
24
- | SFT v5 → full fine-tune | More params = better | 0.667, probing -16pp | Unsloth fell back to LoRA r≈16 silently |
25
- | LoRA v1 (r=64, LR=2e-4, 1ep) | Correct LoRA config | **0.733 ✅** | First win over base |
26
- | Echo-targeted prompt gen | "MUST reference words" prompt | uses_guest -8pp | Base model template bias can't be prompted away |
27
- | reward_v11 correlation test | Info-gain targets uses_guest | -0.098 anti-correlation | reward_v11 rewards genericity, not specificity |
28
- | LoRA v2 (filtered+upsampled) | Template contamination is root cause | 0.640, no change | Data-side interventions can't fix weight-level priors |
29
-
30
- ---
31
-
32
- ## Root Cause Understanding
33
-
34
- The uses_guest gap (48%→56%→stuck) is a **weight-level prior in the Mamba SSM layers**:
35
-
36
- 1. The template prior (`P("How do you"|context)`) lives in frozen Mamba-2 layers (38/42 layers)
37
- 2. LoRA can only modify 4 attention layers (1.01% of params)
38
- 3. SFT can only ADD positive signal — cannot SUBTRACT the template prior
39
- 4. Data filtering removes positive template examples but the prior persists
40
- 5. **Only RL (GRPO) can directly suppress the prior** via negative advantage signal
41
-
42
- ## Next Step: GRPO with reward_v12 from LoRA v1
43
-
44
- ### Why GRPO now
45
- - LoRA v1 at 0.733 gives a stronger starting point than base (0.653)
46
- - reward_v12 is validated ✅ (HIGH=1.000 > LOW=0.606, gates work)
47
- - GRPO gradient flows through ALL 42 layers — can suppress Mamba template prior
48
- - The model already generates non-template openers ~8% of the time — seeds to amplify
49
-
50
- ### reward_v12 design
51
- ```
52
- reward = ug^0.67 × pr^0.33 + lexical_bonus
53
- ```
54
- - `ug` = `log P(YES)/P(NO)` from uses_guest judge (continuous)
55
- - `pr` = `log P(YES)/P(NO)` from probing judge (continuous)
56
- - `lexical_bonus` = vocab overlap fraction (fast, no model needed)
57
- - Hard gate: `passes_structural_check` (question mark, min length, no collapse patterns)
58
-
59
- ### GRPO stack
60
- - **Model loading**: Unsloth `FastLanguageModel` (kernel patching, chunk_size=64 for GB10)
61
- - **LoRA adapter**: `lora/sft-lora-v1` as starting checkpoint
62
- - **Training**: trl `GRPOTrainer` (patched for transformers 5.x compat)
63
- - **Reward**: `reward_v12.py` (Qwen3.5-4B judge, batched)
64
- - **Infrastructure**: systemd-run --user (survives gateway restarts)
65
-
66
- ### Files
67
- - `reward_v12.py` — validated reward function
68
- - `lora/sft-lora-v1/` — starting checkpoint (0.733)
69
- - `data/sft_v6_train.jsonl` — filtered dataset (prompts only, for GRPO rollouts)
70
-
71
- ---
72
-
73
- ## Infrastructure Lessons (today)
74
-
75
- | Issue | Fix | Rule |
76
- |---|---|---|
77
- | Gateway OOM kills training | `systemd-run --user` | Always launch training as systemd service |
78
- | `torch_empty_cache_steps=250` causes VRAM spike | Set to 10 | Always set in TrainingArguments |
79
- | `use_gradient_checkpointing=False` → VRAM leak | `"unsloth"` GC | Always enable for LoRA training |
80
- | Training not resumable | `resume_from_checkpoint=latest_ckpt` | Always checkpoint + auto-resume |
81
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/CURRENT_STATE_2026-03-30.md DELETED
@@ -1,159 +0,0 @@
1
- # Current State — Lex Fridman Interviewer Project
2
-
3
- Updated: 2026-03-30 14:10 UTC
4
- Project root: `/home/bobber/lex-ft`
5
- HF docs target: `bobber/lex-fridman-interviewer-project`
6
-
7
- ---
8
-
9
- ## Today's Work (2026-03-30) — Full Summary
10
-
11
- ### 1. SFT Retrospective — 4 Attempts (v1–v4)
12
-
13
- All used 201 curated pairs. Results: base=0.653, SFT=0.467 (worse).
14
-
15
- Root causes identified:
16
- - v1: mamba_ssm mock breaks backprop (grad_norm 200, loss stuck at 34)
17
- - v2: Native transformers 5.3.0 — no compiled SSM kernels (180s/step naive fallback)
18
- - v3: Triton patch + mock — wrong starting loss (56 vs 28), crashed at step 25 (`_get_tied_weight_keys` bug)
19
- - v4: Real mamba_ssm via LD_LIBRARY_PATH + Triton patch — completed but 201 pairs too few (underfitting, loss 21, model degrades)
20
-
21
- Key lesson: **201 pairs / 3.97B params = 20M params/example — wildly unstable**. Need 1,000-5,000+ pairs.
22
-
23
- ---
24
-
25
- ### 2. Data Expansion
26
-
27
- **Crawled all 225 Lex transcript URLs → 114 unique episodes** (225 URLs were duplicates from pagination). No new episodes available.
28
-
29
- **Augmented with base model generation:**
30
- - 3,480 unique real Lex pairs (from 114 episodes, structural filter)
31
- - Generated 3 completions × 3,480 guests = 10,440 candidates
32
- - Structural filter → 9,364 total pairs saved to `data/lex_pairs_10k.jsonl`
33
-
34
- **Judged with vLLM Qwen3.5-4B (batch mode, ~34 min for 9,364 × 3 = 28,092 queries):**
35
- - score=1.0 (3/3 judges): 4,772 pairs (51%)
36
- - score=0.67 (2/3 judges): 1,768 pairs (19%)
37
- - score=0.33: 1,497 pairs (16%)
38
- - score=0.00: 1,327 pairs (14%)
39
-
40
- **Training set: `data/sft_v5_train.jsonl` — 4,772 perfect-score pairs**
41
- - Real Lex: 697 | Generated: 4,075
42
- - Avg score: 1.0 (by definition)
43
-
44
- ---
45
-
46
- ### 3. Venv Infrastructure Fix
47
-
48
- **Problem:** `.venv-train` had torch CPU-only → Unsloth couldn't initialize GPU.
49
-
50
- **Fix sequence:**
51
- 1. Install torch 2.10.0+cu130 (CUDA build) into `.venv-train`
52
- 2. Recompile `mamba_ssm` from source against new torch (`.so` had ABI mismatch)
53
- 3. Install `unsloth` 2026.3.17
54
-
55
- **Result: `.venv-train` now has:**
56
- - `torch 2.10.0+cu130` — CUDA enabled
57
- - `unsloth 2026.3.17` — memory optimizations, 2x faster training
58
- - `mamba_ssm 2.3.1` — **real compiled Triton kernels** (no mock needed)
59
- - Requires: `LD_LIBRARY_PATH=/path/to/.venv-train/torch/lib` (set in launch script)
60
-
61
- **Key insight:** No mock needed with real mamba_ssm. No `patch_mamba_layers` needed. Pure HF Trainer + Unsloth + real kernels.
62
-
63
- ---
64
-
65
- ### 4. SFT v5 — LoRA r≈16 (COMPLETED, 2026-03-30)
66
-
67
- > ⚠️ **Naming correction:** Despite `full_finetuning=True` being set, Unsloth silently fell back
68
- > to LoRA for NemotronH. Training log showed **10.1M / 2.66B (0.38%) trainable**, equivalent to
69
- > **LoRA r≈16**. This is NOT a full fine-tune. See `docs/LORA_V1_ANALYSIS.md` for details.
70
-
71
- **Script:** `scripts/train_sft_v5.py`
72
- **W&B run:** `lex-sft-v5-4k-bnb8bit` → https://wandb.ai/bobber-cheng/lex-interviewer/runs/udqlwz88
73
-
74
- **Config (actual):**
75
- | Parameter | Value |
76
- |-----------|-------|
77
- | Data | `data/sft_v5_train.jsonl` (4,772 pairs) |
78
- | Architecture | **LoRA r≈16** (Unsloth fallback — NOT full fine-tune) |
79
- | Trainable params | **10.1M / 2.66B (0.38%)** |
80
- | Framework | Unsloth + HF Trainer |
81
- | Epochs | 3 |
82
- | LR | 1e-5 ⚠️ (too low for LoRA — designed for full fine-tune) |
83
- | Batch | 2 × 8 = 16 effective (BNB 8-bit Adam) |
84
- | Max seq | 512 |
85
- | Steps | 897 total |
86
- | Final loss | 1.86 |
87
-
88
- **Result: 0.667 functional score — marginal vs base 0.653.**
89
- Probing DAMAGED: 80% → 64% (memorized question surface format over 3 epochs).
90
-
91
- **Why it underperformed:**
92
- 1. Low rank (r≈16): insufficient capacity for nuanced task
93
- 2. LR=1e-5 too low for LoRA (correct for full fine-tune, not for adapters)
94
- 3. 3 epochs: memorized surface pattern, destroyed depth
95
-
96
- ---
97
-
98
- ## Venv Reference
99
-
100
- | Venv | Python | Torch | CUDA | mamba_ssm | Unsloth | Use for |
101
- |------|--------|-------|------|-----------|---------|---------|
102
- | `.venv-train` | 3.12 | 2.10+cu130 | ✅ | ✅ Real (compiled) | ✅ 2026.3.17 | **SFT training** |
103
- | `.venv-vllm` | 3.12 | 2.9+cu130 | ✅ | ❌ x86 .so | ❌ | **vLLM inference/eval** |
104
- | `routangseng/.venv` | 3.13 | 2.9+cu130 | ✅ | ❌ broken .so | ✅ | Legacy/Qwen training |
105
-
106
- **Critical:** `.venv-train` requires `LD_LIBRARY_PATH` set before launch:
107
- ```bash
108
- TORCH_LIB=/home/bobber/lex-ft/.venv-train/lib/python3.12/site-packages/torch/lib
109
- export LD_LIBRARY_PATH="$TORCH_LIB:${LD_LIBRARY_PATH:-}"
110
- source /home/bobber/lex-ft/.venv-train/bin/activate
111
- ```
112
-
113
- ---
114
-
115
- ## Pipeline Reference
116
-
117
- ```
118
- Transcripts (114 eps)
119
- ↓ scripts/crawl_transcripts.py
120
- data/transcripts/*.json
121
- ↓ scripts/augment_with_base_model.py (vLLM .venv-vllm)
122
- data/lex_pairs_10k.jsonl (9,364 pairs)
123
- ↓ scripts/judge_vllm.py (vLLM + Qwen3.5-4B, .venv-vllm)
124
- data/lex_pairs_10k_judged.jsonl (scored)
125
- ↓ filter judge_score==1.0
126
- data/sft_v5_train.jsonl (4,772 perfect pairs)
127
- ↓ scripts/train_sft_v5.py (.venv-train + Unsloth)
128
- models/sft-v5/
129
- ↓ eval_functional_judge.py (vLLM + Qwen3.5-4B, .venv-vllm)
130
- results/
131
- ```
132
-
133
- ---
134
-
135
- ## Eval Infrastructure
136
-
137
- | File | Purpose | Venv |
138
- |------|---------|------|
139
- | `eval_functional_judge.py` | Canonical eval — vLLM + Qwen3.5-4B 3-judge | .venv-vllm |
140
- | `scripts/judge_vllm.py` | Batch judge for dataset scoring | .venv-vllm |
141
- | `scripts/augment_with_base_model.py` | Generate + filter training data | .venv-vllm |
142
- | `scripts/train_sft_v5.py` | SFT training | .venv-train |
143
- | `run_sft_v5.sh` | Launch script (sets LD_LIBRARY_PATH) | .venv-train |
144
-
145
- ---
146
-
147
- ## Key Results So Far
148
-
149
- | Model | Judge Score | on_topic | uses_guest | probing | Notes |
150
- |-------|-------------|----------|------------|---------|-------|
151
- | Base Nemotron 4B | 0.653 ± 0.333 | 68% | 48% | 80% | Baseline |
152
- | GRPO step_100 | 0.613 | — | — | — | log-ratio reward, slightly worse |
153
- | SFT curated v4 (201 pairs) | 0.467 | — | — | — | Too few pairs, catastrophic forgetting |
154
- | SFT v5 **(LoRA r≈16, 3ep, LR=1e-5)** | 0.667 | 76% | 60% | **64%** ⚠️ | Unsloth fallback; probing damaged |
155
- | **LoRA v1 (r=64, 1ep, LR=2e-4)** | **0.733** | 72% | 56% | **92%** ✅ | First to beat base |
156
-
157
- ---
158
-
159
- *Created: 2026-03-30 04:35 UTC*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/DATA_CURATION_PLAN.md DELETED
@@ -1,131 +0,0 @@
1
- # Data Curation Plan — Sharp Lex Questions
2
-
3
- Created: 2026-03-29
4
- Status: Next step (not yet started)
5
-
6
- ---
7
-
8
- ## Motivation
9
-
10
- Training data quality analysis revealed:
11
- - 17,778 total (guest → Lex question) pairs
12
- - **6,460 (36%) "lazy"**: <8 words or specificity <0.02 ("How do you approach that?")
13
- - **1,287 (7%) "sharp"**: >15 words, specificity >0.1 (reference guest's specific content)
14
- - Ratio: 5:1 lazy to sharp — we've been drowning signal in noise
15
-
16
- Every fine-tuning attempt to date trained on the full mixed dataset. The model learned "average Lex" (which is worse than the base model's contextual generation).
17
-
18
- **Key insight:** SFT on curated *Lex at his best* has a higher ceiling than SFT on base model outputs. Lex's best questions are real human ground truth — they can exceed what the base model generates spontaneously.
19
-
20
- ---
21
-
22
- ## Why This Can Beat Base
23
-
24
- - SFT on base model outputs → ceiling is base model
25
- - SFT on reward-filtered base outputs → still ceiling is base
26
- - SFT on curated ground truth (sharp Lex questions) → ceiling is Lex's best
27
-
28
- Validated: real sharp Lex questions score ~0.37 mean info-gain, vs base model 0.36. The margin is small but real, and with proper curation we can select the top quintile.
29
-
30
- ---
31
-
32
- ## Step 1: Filter Training Data → Sharp Questions
33
-
34
- ```python
35
- # Criteria for "sharp" question:
36
- # 1. Ends in '?'
37
- # 2. 10-50 words (not too short = generic, not too long = rambling)
38
- # 3. Specificity score > 0.10 (shares >10% of guest's 5+ char words)
39
- # 4. Not a statement ("Right.", "Exactly.", "So...")
40
- # 5. Not a filler ("How does that make you feel?", "Tell me more...")
41
-
42
- # Expected yield: ~1,287 pairs from 7,580 training segments
43
- # (~17% of pairs, since many segments have multiple turns)
44
- ```
45
-
46
- **Script:** `scripts/curate_sharp_questions.py` (to build)
47
-
48
- ---
49
-
50
- ## Step 2: Score with eval_functional_judge
51
-
52
- Run the 3-judge eval on the curated pairs to rank them:
53
-
54
- ```python
55
- # For each (guest, sharp_question) pair:
56
- # - Run on_topic, uses_guest, probing judges
57
- # - Keep top 500 by score
58
- # - This gives us the "Lex at his best" dataset
59
-
60
- # Expected: ~500-700 pairs scoring 3/3 judges
61
- ```
62
-
63
- **Note:** This is different from eval — we're scoring HUMAN questions (Lex's), not model-generated ones. These are the gold examples we want the model to learn from.
64
-
65
- ---
66
-
67
- ## Step 3: SFT on Curated Data
68
-
69
- ```bash
70
- # Use existing full-SFT pipeline (train_full_sft_v3_optimized.py)
71
- # Dataset: ~500 top-scored (guest, Lex_question) pairs
72
- # Epochs: 3-5 (small dataset needs more passes)
73
- # LR: 1e-5 (conservative — small dataset)
74
- # Max seq len: 512 (questions are short)
75
- ```
76
-
77
- Key difference from all previous SFT runs:
78
- - Previous: all 7,580 segments (5:1 noise ratio)
79
- - Now: ~500 curated sharp pairs (near-100% signal)
80
-
81
- ---
82
-
83
- ## Step 4: Eval with eval_functional_judge
84
-
85
- ```bash
86
- python -u eval_functional_judge.py \
87
- --model checkpoints/sft_curated/checkpoint-best \
88
- --model2 base \
89
- --n 25
90
- ```
91
-
92
- **Success criterion:** Judge score > 0.653 (base model baseline)
93
- **If successful:** First fine-tuned model to beat base in project history
94
-
95
- ---
96
-
97
- ## Step 5 (if Step 4 succeeds): Build Clean Eval Set
98
-
99
- The contamination problem makes absolute scores meaningless. Once we have a model that beats base on the contaminated set, we need to verify it generalizes:
100
-
101
- ```python
102
- # Crawl 10 recent Lex episodes not in training (post-2024)
103
- # Extract ~50 guest utterances
104
- # Use as held-out eval going forward
105
- ```
106
-
107
- ---
108
-
109
- ## Timeline Estimate
110
-
111
- | Step | Time | Notes |
112
- |------|------|-------|
113
- | Build curation script | 1h | Filter + specificity scoring |
114
- | Run curation + judge scoring | 30 min | Judge 1,287 pairs |
115
- | SFT training | 2-4h | ~500 pairs, 3 epochs |
116
- | Eval | 15 min | eval_functional_judge |
117
- | Total | ~4-6h | |
118
-
119
- ---
120
-
121
- ## Files to Build
122
-
123
- | File | Purpose |
124
- |------|---------|
125
- | `scripts/curate_sharp_questions.py` | Filter + score training data |
126
- | `data/sharp_questions_curated.jsonl` | Output: top ~500 pairs |
127
- | `run_sft_curated.sh` | Training launch script |
128
-
129
- ---
130
-
131
- *Created: 2026-03-29 21:30 UTC*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/EVAL_FRAMEWORK_2026-03-29.md DELETED
@@ -1,148 +0,0 @@
1
- # Eval Framework — Lex Fridman Interviewer
2
-
3
- Updated: 2026-03-29
4
- Status: Complete rewrite after contamination discovery + judge validation
5
-
6
- ---
7
-
8
- ## TL;DR
9
-
10
- **Use `eval_functional_judge.py` for all future evals.**
11
- Old evals (eval_v2, eval_v3, eval_clean) are retired — contaminated data + circular signal.
12
-
13
- ---
14
-
15
- ## What Went Wrong With Previous Evals
16
-
17
- ### Problem 1: Eval Contamination
18
- `data/held_out_eval.jsonl` was built from the same 113-episode crawl as training data.
19
- Result: 50/50 held-out prompts appear verbatim in `interview_segments_v2.jsonl`.
20
- All eval scores before 2026-03-29 evening are invalid as generalization measures.
21
-
22
- ### Problem 2: Lex Circularity
23
- `eval_v3` (Claude Opus judge, Lex-style dimensions) measures:
24
- - Philosophical depth in Lex's style
25
- - Curiosity in Lex's style
26
- - Specificity in Lex's style
27
-
28
- Both the training reward (log-ratio: "sounds like Lex") and the eval measure the same proxy.
29
- The base model gets 7.39/10 because it was pretrained on Lex's public transcripts.
30
- Fine-tuning toward this eval cannot exceed pretraining — it just reinforces the same surface patterns.
31
-
32
- ### Problem 3: Qwen 0.5B Simulator Fails on Niche Topics
33
- `eval_functional.py` used Qwen 0.5B to simulate guest responses, then cosine similarity.
34
- Works for common topics (politics, philosophy). Fails for technical content (SSM/BF16/Mamba):
35
- - Qwen 0.5B doesn't know what "exp(cumsum(A)) underflow" means
36
- - Generates plausible-sounding but semantically wrong responses
37
- - Expert question scores LOWER than generic question (inverted signal)
38
-
39
- ---
40
-
41
- ## The New Canonical Eval: eval_functional_judge.py
42
-
43
- ### Architecture
44
-
45
- ```
46
- held_out_eval.jsonl
47
-
48
-
49
- [vLLM: Nemotron 4B] ──batch 25 prompts──► questions (10s)
50
-
51
- ▼ (del llm, empty_cache)
52
- [Qwen3.5-4B Judge] ──3 binary judges per question──► scores
53
-
54
-
55
- score = mean(on_topic, uses_guest, probing) ∈ [0, 1]
56
- ```
57
-
58
- ### The 3 Judges
59
-
60
- | Judge | Prompt summary | What it catches |
61
- |-------|---------------|-----------------|
62
- | `on_topic` | Is question about same subject as guest? | Off-topic tangents |
63
- | `uses_guest` | Does it reference guest's specific words/concepts? | Generic questions that ignore what was said |
64
- | `probing` | Does it probe deeper, not just ask for repetition? | "Can you say more about that?" questions |
65
-
66
- ### Why These 3 Judges
67
-
68
- From per-prompt analysis of base vs GRPO step_100:
69
- - The biggest quality difference was in `uses_guest` (-8pp) and `probing` (-4-8pp)
70
- - `on_topic` was stable — both models stay on subject
71
- - These 3 together correctly rank: SHARP > RESTATE > GENERIC > OFFTOPIC
72
-
73
- ### Validation Results
74
-
75
- | Question type | Score | on_topic | uses_guest | probing |
76
- |---|---|---|---|---|
77
- | SHARP (probing, specific) | 3/4 | Y | Y | Y |
78
- | GENERIC (on topic, not specific) | 2/4 | Y | N | N |
79
- | RESTATE (asks for repetition) | 2/4 | Y | Y | N |
80
- | OFFTOPIC | 1/4 | N | N | N |
81
-
82
- Tested across: general AI/LM topics, niche technical (SSM/BF16), political/historical.
83
- 4B handles all domains correctly. 0.8B fails (inverted judgments).
84
-
85
- ---
86
-
87
- ## Running Evals
88
-
89
- ### Standard comparison
90
- ```bash
91
- cd /home/bobber/lex-ft && source .venv-vllm/bin/activate
92
- python -u eval_functional_judge.py \
93
- --model base \
94
- --model2 results/grpo_v8/ckpt_step_100 \
95
- --n 25 \
96
- --output results/my_eval.json
97
- ```
98
-
99
- ### Single model
100
- ```bash
101
- python -u eval_functional_judge.py --model base --n 25
102
- python -u eval_functional_judge.py --model results/grpo_v8/ckpt_step_100 --n 25
103
- ```
104
-
105
- ### Runtime
106
- - Base model only: ~8 min (2 min vLLM load + 10s gen + 5 min judging)
107
- - Two model comparison: ~15 min total (vLLM reloads for step_100 + LoRA)
108
-
109
- ---
110
-
111
- ## Known Limitations
112
-
113
- 1. **Contaminated held-out set** — all 50 prompts are in training data. Use for comparison only; absolute scores don't reflect true generalization.
114
- 2. **Stochastic generation** — temperature=0.7 means re-runs vary slightly. Use n≥25 for signal.
115
- 3. **Judge agreement** — 3 binary votes is coarse. High variance (std ~0.33). Need n≥50 for statistically significant deltas.
116
- 4. **No clean eval set yet** — need to crawl recent episodes not in training data.
117
-
118
- ---
119
-
120
- ## TODO: Build Clean Eval Set
121
-
122
- ```python
123
- # Target: 50 guest utterances from episodes NOT in interview_segments_v2.jsonl
124
- # Criteria:
125
- # - Recent episodes (post 2024, likely post Nemotron pretraining cutoff)
126
- # - Guest domains: mix of technical, political, philosophical, creative
127
- # - Guest utterances: 50-200 words, substantive statements
128
-
129
- # Episodes crawled: 113 (all in training)
130
- # Need: ~10 new episodes → ~50 new eval prompts
131
- ```
132
-
133
- ---
134
-
135
- ## Deprecated Evals
136
-
137
- | File | Why deprecated |
138
- |------|---------------|
139
- | `eval_clean.py` | Contaminated held-out set; eval_v3 scorer (Lex circularity) |
140
- | `eval_functional.py` | Qwen 0.5B sim fails on niche topics |
141
- | `eval_functional_vllm.py` | Cosine sim still weak; replaced by judge |
142
- | `scripts/eval_v2.py` | Lex-style dimensions; circular |
143
- | `scripts/eval_v3.py` | Same |
144
- | `eval_judge_test.py` | Research script; not for production eval |
145
-
146
- ---
147
-
148
- *Created: 2026-03-29 21:30 UTC*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/EVAL_RESULTS.md DELETED
@@ -1,319 +0,0 @@
1
- > **Status:** ✅ UPDATED 2026-04-05 — Group judge (Qwen 3.5 27B + Gemma 4 31B majority vote) leaderboard added. Cloud model comparison.
2
-
3
- # Eval Results — Lex Fridman AI Interviewer
4
-
5
- ---
6
-
7
- ## ⚠️ Eval Framework History
8
-
9
- | Period | Eval method | Bias risk | Status |
10
- |---|---|---|---|
11
- | Pre-2026-03-29 | 5-score / 10-score (Claude Opus, Lex-style) | High circularity | Legacy only |
12
- | 2026-03-29 | info-gain functional eval | anti-correlation with uses_guest | Abandoned |
13
- | 2026-03-30+ | 3-judge functional (on_topic × uses_guest × probing) | Low — no Lex style | **Canonical** |
14
- | 2026-04-03+ | Same 3-judge, **thinking-enabled** (`enable_thinking=True` + `reasoning_parser=nemotron_v3`) | Low | **Current canonical** |
15
- | 2026-04-05+ | **Group judge**: Qwen 3.5 27B + Gemma 4 31B majority vote per dimension | Lowest — multi-model | **Current canonical (cross-model)** |
16
-
17
- ---
18
-
19
- ## ═══════════════════════════════════════════════════════════════
20
- ## CURRENT BEST: GRPO v21 — 0.867 (thinking-enabled)
21
- ## ═══════════════════════════════════════════════════════════════
22
-
23
- Adapter: `/home/bobber/lex-ft/lora/grpo-v21`
24
- ONNX: `bobber/lex-interviewer-nemotron-4b-grpo-v21`
25
- Space: `bobber/lex-interviewer-chat`
26
-
27
- ---
28
-
29
- ## Thinking-Enabled Functional Eval Leaderboard (canonical, 2026-04-03)
30
-
31
- *Eval: `eval_functional_judge.py --enable-thinking`, 25 held-out prompts, `enable_thinking=True`, `reasoning_parser=nemotron_v3`*
32
-
33
- | Rank | Model | Score | on_topic | uses_guest | probing | Avg words | Notes |
34
- |------|-------|-------|----------|------------|---------|-----------|-------|
35
- | 🥇 | **GRPO v21** | **0.867 ± 0.231** | 84% | 80% | **96%** | ~13 | **Best ever** |
36
- | 2 | Base Nemotron 4B | 0.760 ± 0.371 | — | — | — | — | Strong baseline |
37
- | 3 | GRPO v22 | 0.813 ± 0.314 | 84% | 72% | 88% | ~10 | Less clipping, more generic |
38
- | 4 | GRPO v24 | 0.693 ± 0.399 | 64% | 68% | 76% | ~16 | reward_v13 from v2-native |
39
- | 5 | LoRA v2 native | 0.707 ± 0.331 | 72% | 60% | 84% | ~15 | Best pure SFT |
40
- | 6 | GRPO v23 | 0.760 ± 0.371 | 84% | 68% | 76% | ~10 | reward_v13 from v21, tied base |
41
-
42
- ---
43
-
44
- ## Group Judge Leaderboard — Cloud + Local (2026-04-05)
45
-
46
- *Eval: `eval_cloud_models.py` + `eval_local_group_judge.py`, 25 held-out prompts, majority vote of Qwen 3.5 27B + Gemma 4 31B per dimension*
47
-
48
- | Rank | Model | Score | on_topic | uses_guest | probing | Avg words | Notes |
49
- |------|-------|-------|----------|------------|---------|-----------|-------|
50
- | 🥇 | **GPT-5.4** | **0.867 ± 0.211** | 92% | 68% | 100% | ~29 | Best cloud model |
51
- | 2 | Gemini 3.1 Pro | 0.840 ± 0.341 | 84% | 80% | 88% | ~25 | |
52
- | 3 | **GRPO v21 (4B)** | **0.787 ± 0.376** | 80% | 72% | 84% | ~16 | **Tied Opus — 4B model** |
53
- | 3 | Claude Opus 4.6 | 0.787 ± 0.364 | 76% | 72% | 88% | ~56 | Verbose (3.5× more words) |
54
-
55
- ### Single-Judge Comparison (Gemma 4 31B only)
56
-
57
- | Rank | Model | Score | on_topic | uses_guest | probing | Avg words |
58
- |------|-------|-------|----------|------------|---------|-----------|
59
- | 1 | GPT-5.4 | 0.973 ± 0.131 | 96% | 96% | 100% | ~29 |
60
- | 2 | Gemini 3.1 Pro | 0.893 ± 0.244 | 92% | 88% | 88% | ~25 |
61
- | 3 | Claude Opus 4.6 | 0.880 ± 0.281 | 84% | 84% | 96% | ~56 |
62
-
63
- *Note: Single Gemma judge is more lenient than group judge. Group judge (majority vote) is the canonical cross-model eval.*
64
-
65
- ---
66
-
67
- ## Non-Thinking Functional Eval Leaderboard (historical, 2026-03-30 – 2026-04-02)
68
-
69
- *Eval: `eval_functional_judge.py`, 25 prompts, no thinking, legacy comparison*
70
-
71
- | Rank | Model | Score | on_topic | uses_guest | probing | Notes |
72
- |------|-------|-------|----------|------------|---------|-------|
73
- | 1 | **LoRA v2 native** | **0.760** | 76% | 68% | 80% | Correct GB10 kernel path |
74
- | 2 | Base Nemotron 4B | 0.753 | — | — | — | |
75
- | 3 | GRPO v20/v21 (non-thinking eval) | 0.720 ± 0.336 | — | — | — | Eval had thinking disabled |
76
- | 4 | LoRA v1 (r=64, 1ep) | 0.733 | 72% | 56% | 92% | First to beat base |
77
- | 5 | GRPO v13 | 0.773 | 88% | 52% | 92% | Best before native path |
78
- | 6 | GRPO v12 | 0.760 | 72% | 60% | 96% | |
79
- | 7 | SFT v5 (LoRA r≈16) | 0.667 | 76% | 60% | 64% ⚠️ | Probing damaged |
80
- | 8 | Base Nemotron 4B | 0.653 | 68% | 48% | 80% | Older run |
81
- | 9 | LoRA v2 (filtered data) | 0.640 | — | — | — | No gain from filtering |
82
- | 10 | GRPO v14 | 0.707 | — | 52% | — | Reward misaligned |
83
- | 11 | GRPO v11 step_100 | 0.613 | — | — | — | info-gain reward failed |
84
-
85
- ---
86
-
87
- ## Training & Model Lineage — Knowledge Graph
88
-
89
- ```
90
- BASE MODEL
91
- └── nvidia/NVIDIA-Nemotron-3-Nano-4B
92
- Architecture: 38 Mamba-2 + 4 Attention layers
93
- Path: models/NVIDIA-Nemotron-3-Nano-4B
94
- Config: config_native.json (native transformers 5.3 path, not trust_remote_code)
95
-
96
- ├── [SFT Phase 1 — OLD BROKEN PATH — pre-2026-04-02]
97
- │ │
98
- │ ├── SFT v1 (Unsloth, LoRA r≈16, LR=1e-5, 3ep)
99
- │ │ Score: 2.10/5 ← catastrophic
100
- │ │
101
- │ ├── SFT v5 "full-ft" → actually LoRA r≈16 (Unsloth silently fell back)
102
- │ │ Score: 0.667 uses_guest=60% probing=64%⚠️ ← probing damaged
103
- │ │
104
- │ ├── LoRA v1 (r=64, alpha=128, LR=2e-4, 1ep, 299 steps)
105
- │ │ Score: 0.733 uses_guest=56% probing=92% ← first to beat base
106
- │ │ Adapter: lora/sft-lora-v1
107
- │ │ │
108
- │ │ └── [GRPO Phase 1 — OLD PATH]
109
- │ │ ├── GRPO v12 (LR=5e-6, reward_v12, 200 steps)
110
- │ │ │ Score: 0.760 uses_guest=60% probing=96%
111
- │ │ │
112
- │ │ ├── GRPO v13 (LR=2e-5, constant, 300 steps)
113
- │ │ │ Score: 0.773 on_topic=88% uses_guest=52%⚠️ probing=92%
114
- │ │ │ ← LR too high, uses_guest regressed
115
- │ │ │
116
- │ │ └── GRPO v14 (reward_v13 geomean, LR=1e-5)
117
- │ │ Score: 0.707 ← reward misaligned
118
- │ │
119
- │ └── LoRA v2 (filtered data, 0% generic openers, 60% real Lex ×6)
120
- │ Score: 0.640 uses_guest=48% ← no gain from data filtering
121
-
122
- ├── [SFT Phase 2 — NATIVE PATH — 2026-04-02+]
123
- │ │ Fix: use native transformers 5.3 NemotronH (cuda_kernels_forward)
124
- │ │ Patches: config validator, MIXER_TYPES, block_type_to_mask for "mlp"
125
- │ │ Train PPL: ~1.29 (vs ~21.9 on broken torch_forward path)
126
- │ │
127
- │ └── LoRA v2 native (r=64, alpha=128, LR=2e-4, 1ep, 12 min)
128
- │ Score: 0.760 uses_guest=68% probing=80% ← best SFT ever
129
- │ Adapter: lora/sft-lora-v2-native
130
- │ Dataset: data/sft_v5_train.jsonl (4,772 pairs, 697 real + 4,075 generated)
131
- │ │
132
- │ └── [GRPO Phase 2 — NATIVE + THINKING — 2026-04-02+]
133
- │ │ Framework: TRL GRPOTrainer + vLLM colocate
134
- │ │ Thinking: enable_thinking=True + reasoning_parser=nemotron_v3
135
- │ │ Reward: reward_v12 (uses_guest×probing geomean + lexical bonus)
136
- │ │
137
- │ ├── GRPO v19 (smoke test, 50 steps, reward_v12)
138
- │ │ reward mean: 0.39 IS ratio: 0.025 ← infrastructure verified
139
- │ │
140
- │ ├── GRPO v20 (200 steps, MAX_NEW_TOKENS=800)
141
- │ │ Score (non-thinking eval): 0.720 ± 0.336
142
- │ │ Clipping: ~50% of steps had ≥1 clipped completion
143
- │ │ ← budget too small, thinking truncated
144
- │ │
145
- │ ├── ★ GRPO v21 (200 steps, MAX_NEW_TOKENS=1600, MAX_SEQ=3072)
146
- │ │ Score (thinking-enabled): 0.867 ± 0.231 ← BEST EVER
147
- │ │ on_topic=84% uses_guest=80% probing=96% avg_words=13
148
- │ │ Clipping: 32.5% of steps, avg_group_std=0.228 ← Goldilocks
149
- │ │ Reward delta: +0.083 (first20=0.631 → last20=0.715)
150
- │ │ W&B: lex-grpo-v21-think-colocate-long / z8lcuut7
151
- │ │ Adapter: lora/grpo-v21
152
- │ │ ONNX: bobber/lex-interviewer-nemotron-4b-grpo-v21 ← deployed
153
- │ │
154
- │ ├── GRPO v22 (200 steps, MAX_NEW_TOKENS=2560, MAX_SEQ=4096, CLIP_PENALTY=0.10)
155
- │ │ Score (thinking-enabled): 0.813 ± 0.314
156
- │ │ on_topic=84% uses_guest=72% probing=88% avg_words=10
157
- │ │ Clipping: 9.5% of steps ← less clipping BUT less contrast
158
- │ │ Reward delta: +0.013 ← much weaker learning signal
159
- │ │ ← larger budget eliminated contrast generators
160
- │ │
161
- │ ├── [GRPO Phase 3 — reward_v13 — 2026-04-03]
162
- │ │ reward_v13: adds meta-spill penalty, generic-opener penalty,
163
- │ │ soft overthinking penalty; no reward for long thinking
164
- │ │
165
- │ ├── GRPO v23 (reward_v13, from grpo-v21, 1600 tok)
166
- │ │ Score (thinking-enabled): 0.760 ± 0.371 ← tied base
167
- │ │ Reward delta: -0.019 ← reward_v13 too strict from strong start
168
- │ │ ← started at local optimum, reward_v13 compressed variance
169
- │ │
170
- │ └── GRPO v24 (reward_v13, from sft-lora-v2-native, 1600 tok)
171
- │ Score (thinking-enabled): 0.693 ± 0.399 ← tied base
172
- │ Reward delta: -0.018
173
- │ ← reward_v13 not generating sufficient learning signal
174
-
175
- └── [GRPO — OLD FRAMEWORK — pre-native, 2026-03-21–29]
176
- GRPO v3 (off-policy, llama.cpp gen + HF train) → gibberish
177
- GRPO v7/v8/v11 → various failures, documented in memory/2026-03-*.md
178
- ```
179
-
180
- ---
181
-
182
- ## Key Technical Details
183
-
184
- ### Architecture
185
- - **Model:** Nemotron-3-Nano-4B (hybrid Mamba-2 + Attention)
186
- - **Layers:** 38 Mamba-2 + 4 Attention (42 total)
187
- - **LoRA targets:** q/k/v/o_proj, up/down/gate_proj → only touches the 4 attention layers (standard LoRA)
188
- - **LoRA config:** r=64, alpha=128, dropout=0, trainable=40.5M/4.01B (1.01%)
189
-
190
- ### The GB10 Kernel Fix (2026-04-02)
191
- - NVIDIA's HF `modeling_nemotron_h.py` had `is_fast_path_available = False` hardcoded
192
- - This forced naive `torch_forward` SSM scan → PPL ~2126
193
- - Fix: use native transformers 5.3.0 built-in `NemotronHForCausalLM` with 3 patches:
194
- - `configuration_nemotron_h.py`: config validator accepts `"mlp"` block type
195
- - `modeling_nemotron_h.py`: add `MIXER_TYPES["mlp"]` and `block_type_to_mask["mlp"]`
196
- - `config_native.json`: use `layers_block_type` list instead of `hybrid_override_pattern`
197
- - Result: train PPL 1.29 avg (vs 21.86 on broken path)
198
-
199
- ### Thinking-Enabled Inference Stack
200
- - vLLM 0.18.0 + `structured_outputs_config={'reasoning_parser': 'nemotron_v3'}`
201
- - Chat template: `enable_thinking=True` → prompt ends with `<|im_start|>assistant\n<think>\n`
202
- - Output: `reasoning_content` (thinking) + `content` (answer) parsed by vLLM
203
- - Eval uses `max_model_len=3072`, `max_tokens=1600`, `gpu_memory_utilization=0.45`
204
-
205
- ### GRPO Framework
206
- - TRL `GRPOTrainer` + vLLM `vllm_mode="colocate"` (single GPU)
207
- - Group size: 4 completions per prompt
208
- - KL coefficient: β=0.001
209
- - Importance sampling correction enabled
210
- - LR: 5e-6 cosine with linear warmup
211
-
212
- ---
213
-
214
- ## Why GRPO v21 Succeeded — Formal Summary
215
-
216
- The GRPO learning quality score `first50_std × sign(Δreward) × |Δreward|^0.5` tracks eval perfectly:
217
-
218
- | run | eval | first50_std | reward_Δ | GRPO_score |
219
- |---|---|---|---|---|
220
- | **v21** | **0.867** | **0.268** | **+0.083** | **0.078** |
221
- | v22 | 0.813 | 0.156 | +0.013 | 0.018 |
222
- | v23 | 0.760 | 0.187 | -0.019 | -0.004 |
223
- | v24 | 0.693 | 0.195 | -0.018 | -0.004 |
224
-
225
- **Root cause:** GRPO learns from intra-group *contrast*, not from correctness.
226
-
227
- ```
228
- GRPO_success = P(≥1 zero per group) ≈ 0.25–0.35
229
- × hard_binary_gate (0.0 fail vs 0.7+ pass)
230
- × starting_below_optimum
231
- ```
232
-
233
- v21 hit the Goldilocks zone:
234
- - 1600-token budget → 32.5% clipping rate → intra-group std = 0.228 (high)
235
- - reward_v12 hard gate: clipped/meta → exactly 0.0; clean questions → 0.7–1.0
236
- - Starting from `sft-lora-v2-native` (score=0.631) with room to climb to 0.715
237
- - corr(q_tok, group_std) = +0.70: long-answer steps (meta-spill/clipped) = highest contrast
238
-
239
- Full analysis: `docs/GRPO_V21_SUCCESS_ANALYSIS.md`
240
-
241
- ---
242
-
243
- ## Reward Function History
244
-
245
- | Version | Signal | Key feature | Result |
246
- |---|---|---|---|
247
- | reward_v9 | NLI entailment depth | Hard gates + NLI depth score | Baseline structure |
248
- | reward_v10 | log-ratio (SFT likelihood) | Style-matching | Circular — rewards Lex-sounding output |
249
- | reward_v11 | info-gain (novelty × relevance via Qwen-0.5B simulator) | Functional quality | Anti-correlated with uses_guest (-0.098) |
250
- | **reward_v12** | uses_guest^0.67 × probing^0.33 + lexical_bonus | Batch logit-gap judges | **Best** — used in GRPO v12–v22 |
251
- | reward_v13 | v12 + meta-spill penalty + generic-opener penalty + soft overthinking penalty | Stricter gates | Compressed contrast, underperformed |
252
-
253
- ---
254
-
255
- ## Dataset History
256
-
257
- | Dataset | Size | Quality | Used in | Notes |
258
- |---|---|---|---|---|
259
- | interview_segments_v2.jsonl | 7,580 | Mixed (36% lazy, 7% sharp) | GRPO v8–v11 | 5:1 lazy:sharp ratio |
260
- | sft_v5_train.jsonl | 4,772 | All judge_score=1.0 | SFT v5, LoRA v1/v2 | 697 real Lex + 4,075 generated |
261
- | sft_v6_train.jsonl | 6,933 | Filtered, 0% generic openers | LoRA v2 filtered | No improvement over v5 |
262
- | held_out_eval.jsonl | 50→25 | ⚠️ contaminated with training | Evals | First 25 only — 50/50 overlap |
263
-
264
- ---
265
-
266
- ## Eval Infrastructure
267
-
268
- | Script | Purpose | Status |
269
- |---|---|---|
270
- | `eval_functional_judge.py` | 3-judge batch eval (single judge, local models) | Active |
271
- | `scripts/eval_cloud_models.py` | Cloud model eval with single/group judge | Active |
272
- | `scripts/eval_local_group_judge.py` | Local model eval with sequential group judge | Active |
273
- | `eval_functional.py` | Info-gain functional eval | Deprecated |
274
- | `scripts/eval_v2.py` | 10-score Lex-style eval | Legacy |
275
- | `scripts/eval_via_server.py` | 5-score via llama.cpp server | Legacy |
276
-
277
- ### Running current eval
278
- ```bash
279
- cd /home/bobber/lex-ft
280
- source .venv-vllm/bin/activate
281
- python3 eval_functional_judge.py --enable-thinking \
282
- --model base \
283
- --model2 /home/bobber/lex-ft/lora/sft-lora-v2-native \
284
- --model3 /home/bobber/lex-ft/lora/grpo-v21
285
- ```
286
-
287
- ---
288
-
289
- ## Legacy Eval Results (pre-functional, for reference only)
290
-
291
- ### 10-Score Leaderboard (2026-03-29, ⚠️ circular)
292
-
293
- | Rank | Model | 10-score | Avg Words |
294
- |---|---|---|---|
295
- | 🥇 | Base Nemotron 4B | 7.39/10 | 13 |
296
- | 2 | GRPO v11 step_100 | 7.38/10 | 13 |
297
- | 3 | SFT v4 Triton (ck200) | 5.36/10 | 50 |
298
- | 4 | SFT v2 (ck100) | 5.08/10 | 54 |
299
-
300
- ### 5-Score Leaderboard (legacy)
301
-
302
- | Rank | Model | 5-score | Words |
303
- |---|---|---|---|
304
- | 🥇 | Base Nemotron 4B | 4.35/5 | 59 |
305
- | 🥈 | GPT-5.4 | 4.30/5 | 52 |
306
- | 🥉 | Nemotron 30B-A3B Q8 | 4.25/5 | 62 |
307
- | 4 | SFT v4 Triton (ck200) | 3.80/5 | 50 |
308
- | 5 | Gemini 3.1 Pro / Claude Opus 4.6 | 3.70/5 | 82–121 |
309
- | 6 | SFT v2 (ck100) | 3.70/5 | 54 |
310
- | 7 | Qwen3.5-35B-A3B Q8 | 3.55/5 | 51 |
311
- | 8 | Qwen3.5-27B Q8 | 3.25/5 | 75 |
312
- | 9 | Gemini 2.5 Pro | 3.00/5 | 103 |
313
- | 10 | Nemotron SFT v1 LoRA | 2.10/5 | 25 |
314
- | 11 | Nemotron SFT v2 LoRA | 2.00/5 | 292 |
315
- | — | GRPO v3 (off-policy) | N/A | gibberish |
316
-
317
- ---
318
-
319
- *Created: 2026-03-19 | Updated: 2026-04-03 21:40 UTC*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/FULL_FINETUNE_PLAN_2026-03-20.md DELETED
@@ -1,76 +0,0 @@
1
- # Full Fine-Tune Plan — 2026-03-20
2
-
3
- ## Why this run
4
- - The new v2 dataset passed validation.
5
- - The current bad v2 model came from a **LoRA** run, not a full fine-tune.
6
- - So the next clean experiment is: **same good-ish data, different adaptation method**.
7
-
8
- ## Recommendation
9
- Run a **full SFT** on Nemotron 4B with Unsloth using the validated dataset:
10
- - dataset: `data/interview_segments_v2.jsonl`
11
- - max seq: `1792`
12
- - bf16 full fine-tune
13
- - save checkpoint every `50` steps
14
- - do not trust loss alone
15
- - evaluate checkpoint-50 before committing to the whole run
16
-
17
- ## Conservative starting hyperparameters
18
- - epochs: `2`
19
- - per-device batch: `2`
20
- - grad accumulation: `8`
21
- - effective batch: `16`
22
- - learning rate: `1e-5`
23
- - warmup steps: `25`
24
- - optimizer: `adamw_torch`
25
-
26
- ## Why these numbers
27
- - `1792` matches the earlier data-driven sequence choice already used in the repo.
28
- - `1e-5` is intentionally lower than the LoRA run's `2e-4`; full fine-tune should start less aggressively.
29
- - `2 x 8` is conservative for a first full-model run even on a 128 GB machine.
30
- - Once step-50 is stable, batch can be increased if utilization is low.
31
-
32
- ## Success gate at step 50
33
- Checkpoint `checkpoint-50` must be evaluated before we trust the run.
34
-
35
- ### Minimum bar
36
- The step-50 model should:
37
- - clearly beat the bad v2 LoRA behavior
38
- - ask short questions instead of monologues
39
- - ideally approach or exceed the 4.35/5 Nemotron 4B base baseline on the existing eval
40
-
41
- ### If step-50 is bad
42
- Stop and adjust one of:
43
- - lower LR further (for example `5e-6`)
44
- - shorten target format / tighten generation template
45
- - reduce epochs
46
- - inspect whether training strings should exclude some assistant-heavy tails
47
-
48
- ## Launch command
49
- ```bash
50
- cd /home/bobber/lex-ft && \
51
- WANDB_RUN_NAME=lex-interviewer-full-sft-v1 \
52
- OUTPUT_DIR=/home/bobber/lex-ft/checkpoints/lex-interviewer-full-sft-v1 \
53
- FINAL_DIR=/home/bobber/lex-ft/models/lex-interviewer-full-sft-v1 \
54
- MAX_SEQ_LENGTH=1792 \
55
- BATCH_SIZE=2 \
56
- GRAD_ACCUM=8 \
57
- EPOCHS=2 \
58
- LR=1e-5 \
59
- SAVE_STEPS=50 \
60
- python3 scripts/train_full_sft.py |& tee logs/train_full_sft_v1.log
61
- ```
62
-
63
- ## Step-50 eval procedure
64
- 1. wait for `checkpoints/lex-interviewer-full-sft-v1/checkpoint-50`
65
- 2. export / serve that checkpoint with the same inference path used for leaderboard evals
66
- 3. run:
67
- ```bash
68
- cd /home/bobber/lex-ft && python3 scripts/eval_via_server.py full-sft-step50
69
- ```
70
- 4. if promising, optionally also run v2 eval:
71
- ```bash
72
- cd /home/bobber/lex-ft && python3 scripts/eval_v2.py full-sft-step50
73
- ```
74
-
75
- ## Important caution
76
- The current `eval_via_server.py` assumes a llama.cpp-compatible server is already running on `127.0.0.1:30000`. So the missing piece at eval time is not the scorer; it is the checkpoint-serving step.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/FUNCTIONAL_EVAL_DESIGN.md DELETED
@@ -1,134 +0,0 @@
1
- # Functional Eval Design — Info-Gain Based Evaluation
2
-
3
- > **Created:** 2026-03-29
4
- > **File:** `eval_functional.py`
5
- > **Motivation:** Replace Lex-style eval (circular) with functional quality measurement
6
-
7
- ---
8
-
9
- ## The Problem with eval_v3
10
-
11
- `eval_v3` (and v2, v1) score interview questions on Lex Fridman style dimensions:
12
- - Philosophical depth, curiosity, specificity in Lex's style
13
- - Judged by Claude Opus calibrated on real Lex transcripts
14
-
15
- **The circularity problem:**
16
- 1. Base Nemotron 4B already knows Lex Fridman from internet pretraining
17
- 2. Base model scores **7.39/10** with zero fine-tuning
18
- 3. Training with log-ratio reward also teaches Lex-like outputs
19
- 4. Both training signal and eval signal measure the same thing: "sounds like Lex"
20
- 5. Any improvement in functional quality is invisible to the eval
21
-
22
- **Evidence:** GRPO v11 step_100 scores 7.38/10 — essentially identical to base. Either nothing was learned, or the eval can't see it. We can't tell which.
23
-
24
- ---
25
-
26
- ## The Fix: Functional Evaluation
27
-
28
- **Core question:** Does the question unlock new relevant information from the guest?
29
-
30
- ```
31
- info_gain = novelty(sim_response vs guest) × relevance(sim_response vs question)
32
- ```
33
-
34
- ### Pipeline
35
-
36
- ```
37
- guest_statement
38
-
39
-
40
- [Policy Model] ──generates──► question
41
-
42
-
43
- [Guest Simulator (Qwen 0.5B)]
44
- prompted: "You just said: {guest}
45
- Follow-up: {question}
46
- Answer:"
47
-
48
-
49
- sim_response
50
-
51
- ┌──────────────┼──────────────┐
52
- ▼ │ ▼
53
- embed(guest) embed(question) embed(sim_resp)
54
- │ │ │
55
- └──────────────┼──────────────┘
56
-
57
- novelty = 1 - cos(sim_resp, guest)
58
- relevance = cos(sim_resp, question)
59
- info_gain = novelty × relevance
60
- ```
61
-
62
- ### Why This Works
63
-
64
- - **Novelty:** If the question just paraphrases what the guest said, sim_response will be similar to guest statement → low novelty → penalized
65
- - **Relevance:** If the question is off-topic/random, sim_response won't be about it → low relevance → penalized
66
- - **Product:** Only high if question opens a new relevant angle
67
- - **No Lex advantage:** Score cares about function, not style. Base model's Lex knowledge is irrelevant.
68
-
69
- ### Validation (2026-03-29)
70
-
71
- Real Lex questions vs generic LLM questions on same guest statements:
72
- - Real Lex mean info_gain: **0.368**
73
- - Generic mean info_gain: **0.069**
74
- - Diff: **+0.300**, p≈0
75
-
76
- This is the strongest discrimination signal of all reward variants tested (log-ratio: +1.09 nats, NLI: +0.005).
77
-
78
- ---
79
-
80
- ## Implementation
81
-
82
- ```python
83
- # eval_functional.py
84
- BASE_MODEL = 'models/NVIDIA-Nemotron-3-Nano-4B' # policy model
85
- SIM_MODEL = 'Qwen/Qwen2.5-0.5B-Instruct' # guest simulator (frozen)
86
- EMBED_MODEL = 'all-MiniLM-L6-v2' # sentence embedder
87
- HELD_OUT = 'data/held_out_eval.jsonl' # 50 held-out prompts
88
-
89
- # Usage:
90
- # python eval_functional.py --model base
91
- # python eval_functional.py --model results/grpo_v8/ckpt_step_100
92
- # python eval_functional.py --model base --model2 results/grpo_v8/ckpt_step_100
93
- ```
94
-
95
- ### Running with mamba_ssm Mock
96
-
97
- Requires the wrapper script due to compiled extension issues:
98
- ```bash
99
- HF_MODULES_CACHE=/tmp/hf_modules \
100
- python /tmp/run_functional_eval.py \
101
- --model base \
102
- --model2 results/grpo_v8/ckpt_step_100 \
103
- --n 25 \
104
- --output results/functional_eval_base_vs_step100.json
105
- ```
106
-
107
- See `CURRENT_STATE_2026-03-29.md` for setup details.
108
-
109
- ---
110
-
111
- ## Interpreting Results
112
-
113
- | info_gain range | Interpretation |
114
- |---|---|
115
- | > 0.20 | Strong — question opens genuinely new, relevant territory |
116
- | 0.10–0.20 | Moderate — some new ground, partially on-topic |
117
- | < 0.10 | Weak — either repetitive or tangential |
118
-
119
- Expected baseline (base model): ~0.15–0.25 based on real Lex benchmark (0.368).
120
-
121
- ---
122
-
123
- ## Limitations
124
-
125
- 1. **Guest simulator quality:** Qwen 0.5B is a weak simulator. Might miss subtle angles.
126
- 2. **Embedding space:** MiniLM may not capture deep semantic differences.
127
- 3. **Single simulation:** One sim_response per question — stochastic. Could average over 3-5.
128
- 4. **Naive SSM inference:** Without causal_conv1d, generation is slower and slightly different from trained distribution.
129
-
130
- These are known weaknesses. The metric is still far better than the circular Lex-style eval.
131
-
132
- ---
133
-
134
- *Created: 2026-03-29 17:00 UTC*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V11_DESIGN.md DELETED
@@ -1,141 +0,0 @@
1
- # GRPO v11 Design — Info-Gain Reward + On-Policy vLLM
2
-
3
- > **Status:** 🔄 In progress — 2 runs completed, functional eval pending
4
- > **Date:** 2026-03-29
5
- > **Training script:** `grpo_v8_train.py` with `reward_v11.py`
6
-
7
- ---
8
-
9
- ## Motivation
10
-
11
- Every prior GRPO run (v4–v10) showed positive training rewards but flat or degraded eval scores. Root cause analysis: **the reward was measuring a proxy, not the actual goal.**
12
-
13
- | Reward version | What it measured | Problem |
14
- |---|---|---|
15
- | v8 (heuristic) | Structural patterns (?, length, no filler) | Gamed by step 50 |
16
- | v9 (NLI) | Whether question adds beyond guest | Failed discrimination (p=0.48) |
17
- | v10 (log-ratio) | How much question sounds like Lex | Stochastic parrot — rewards style not function |
18
- | **v11 (info-gain)** | **Whether question unlocks new guest info** | **First functional signal** |
19
-
20
- ---
21
-
22
- ## Reward v11: Info-Gain via Simulated Response
23
-
24
- ```
25
- reward = hard_gates × (info_gain + brevity_bonus + specificity_bonus + diversity_bonus)
26
-
27
- info_gain = novelty(sim_response vs guest) × relevance(sim_response vs question)
28
- ```
29
-
30
- **Hard gates** (reward = 0 if any fires):
31
- - `?` count > 4 (multi-question dump)
32
- - Starts with "As Lex Fridman..." (sycophantic framing)
33
- - Contains stage directions `*(..)*`
34
- - Starts with "I" (interviewer-centric)
35
- - < 5 words or > 200 words
36
-
37
- **Info-gain computation:**
38
- 1. Freeze Qwen2.5-0.5B as guest simulator
39
- 2. Prompt it: "You said: {guest}. Follow-up: {question}. Answer:"
40
- 3. Embed {guest}, {question}, {sim_response} with all-MiniLM-L6-v2
41
- 4. `novelty = 1 - cosine(sim_response, guest)` — did response say something new?
42
- 5. `relevance = cosine(sim_response, question)` — was response on-topic?
43
- 6. `info_gain = novelty × relevance`
44
-
45
- **Key property:** base model's knowledge of Lex Fridman gives **zero advantage**. Scorer measures whether the question *works*, not whether it *sounds right*.
46
-
47
- ### Validation Experiments
48
-
49
- | Experiment | Result |
50
- |---|---|
51
- | Hard gate calibration on 200 real Lex questions | 0.5% false positive (PASS) |
52
- | NLI discrimination (real Lex vs generic) | Failed (DROPPED from v11) |
53
- | Log-ratio signal strength | +1.09 nats diff, p≈0 (but parrot problem) |
54
- | Info-gain discrimination (real Lex vs generic) | +0.300 diff, p≈0 ✅ **STRONGEST signal** |
55
- | Guest simulator variance across questions | std=31.7 words ✅ responds meaningfully |
56
-
57
- ---
58
-
59
- ## Training Runs
60
-
61
- ### Run 1 (grpo-v11-run1)
62
- - **Date:** 2026-03-29 04:57–12:48 UTC (~7.8h)
63
- - **Base:** `results/grpo_v8/ckpt_step_100` (GRPO v8 best checkpoint)
64
- - **Steps:** 148
65
- - **Checkpoints:** saved at step 50 and step 100 → `results/grpo_v8/ckpt_step_50/100`
66
- - **Reward progression:**
67
-
68
- | Step range | Reward mean | n_pos/32 | Overlong/32 |
69
- |---|---|---|---|
70
- | 0–10 | 0.81–1.48 | 10–17 | 6–12 |
71
- | ~50 | ~1.1–1.3 | 13–16 | 10–14 |
72
- | ~100 | ~1.1–1.4 | 13–17 | 10–16 |
73
- | 140–147 | ~1.0–1.4 | 13–18 | 7–14 |
74
-
75
- - Rewards consistently positive throughout
76
- - No collapse (unlike v8 which collapsed at step 50)
77
- - `n_pos` trend: slow improvement from ~10 → ~15+ over 148 steps
78
- - **Killed:** gateway restart at step 148 (between step reward and backward pass)
79
-
80
- ### Run 2 (grpo-v11-run2)
81
- - **Date:** 2026-03-29 12:49–13:17 UTC
82
- - **Base:** `results/grpo_v8/ckpt_step_100` (same starting point)
83
- - **Steps:** 108 (steps 106–108 logged)
84
- - **Reward mean at step 108:** 1.241 (healthy)
85
- - Best question: *"When you describe the guest's 'age' metaphor as something that can be adjusted, what does that mean"*
86
- - **Killed:** gateway restart mid-generation (no checkpoint saved)
87
-
88
- ---
89
-
90
- ## Eval Results (v3 scorer, held-out)
91
-
92
- | Model | Score | Avg Words |
93
- |---|---|---|
94
- | Base Nemotron 4B | 7.39/10 | 13 |
95
- | GRPO v11 step_100 | 7.38/10 | 13 |
96
-
97
- **Interpretation:** These scores are likely misleading due to Lex circularity. See `EVAL_RESULTS.md` for full analysis. Functional eval (info-gain) is the correct measurement.
98
-
99
- ---
100
-
101
- ## Functional Eval (Running 2026-03-29)
102
-
103
- `eval_functional.py` comparing base vs step_100 on 25 held-out prompts.
104
-
105
- Expected outcomes:
106
- - **step_100 > base:** GRPO worked, v3 eval was blind to it → launch 500-step full run
107
- - **base ≈ step_100:** Reward didn't teach functional quality → investigate or try full-weight
108
- - **Both near 0:** Naive SSM inference degraded quality → need llama.cpp path for eval
109
-
110
- Results: **PENDING**
111
-
112
- ---
113
-
114
- ## Technical Notes
115
-
116
- ### mamba_ssm Mock for Local Inference
117
- Running Nemotron with `transformers.AutoModelForCausalLM` (vs vLLM) requires mocking `mamba_ssm` because compiled extensions don't build for Python 3.13 or torch 2.9+CUDA13:
118
-
119
- ```python
120
- # /tmp/run_functional_eval.py
121
- # 1. Mock mamba_ssm in sys.modules before any imports
122
- # 2. Patch modeling_nemotron_h.py in HF_MODULES_CACHE to use try/except for causal_conv1d
123
- # 3. Use .venv-vllm (Python 3.12) which has peft + sentence_transformers
124
- ```
125
-
126
- Model falls back to naive SSM (slower but correct). Generation validated with test scripts.
127
-
128
- ---
129
-
130
- ## Next Steps
131
-
132
- 1. Read functional eval results (base vs step_100 info-gain)
133
- 2. Based on results:
134
- - If training works → 500-step full run with reward_v11
135
- - If training doesn't work → SFT from reward_v11 completions (`gen_sft_data.py`)
136
- - If inference is degraded → use llama.cpp for eval (merge LoRA → GGUF)
137
- 3. Consider full-weight training (`--lora-r 0`) — LoRA only touches 4 attention layers out of 42 total
138
-
139
- ---
140
-
141
- *Created: 2026-03-29 17:00 UTC*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V11_POSTMORTEM.md DELETED
@@ -1,120 +0,0 @@
1
- # GRPO v11 Postmortem
2
-
3
- Date: 2026-03-29
4
- Status: Training failed to beat base on clean eval. Root causes identified.
5
-
6
- ---
7
-
8
- ## Results Summary
9
-
10
- | Model | Clean eval (held-out) | Leaky eval | Δ vs base |
11
- |-------|-----------------------|-----------|-----------|
12
- | Base Nemotron 4B | **7.39/10** | 6.46/10 | — |
13
- | Step 100 ckpt | **7.38/10** | 6.54/10 | −0.01 |
14
-
15
- **Net result: 100 steps of GRPO v11 = zero measurable improvement on unseen data.**
16
-
17
- The previous "improvements" (6.70, 7.26, 7.28) were eval artifacts.
18
-
19
- ---
20
-
21
- ## Root Cause 1: Eval Leakage (Masked Everything)
22
-
23
- All 25 eval scenarios were also training prompts (line 147-155 of grpo_v8_train.py).
24
- The val score was measuring memorization, not generalization.
25
-
26
- Fix: `data/held_out_eval.jsonl` — 50 real transcript prompts, zero overlap with training.
27
-
28
- ---
29
-
30
- ## Root Cause 2: Off-Policy Drift (Primary Training Failure)
31
-
32
- IS clipped ratio grew from 0.8% (step 0) to 22.1% (step 145).
33
- By step 100, log_ratio std was ±0.38 and clipped=4.5%.
34
-
35
- | Steps | IS clipped | Effect |
36
- |-------|-----------|--------|
37
- | 0–20 | <2% | Gradients valid |
38
- | 50–80 | 3–5% | Mildly stale |
39
- | 100–145 | **5–22%** | Gradients increasingly corrupt |
40
-
41
- Cause: vLLM generates from weights at step N. Training does 4 gradient steps
42
- (micro_batch=8 on 32 completions). By step 4 the policy has drifted from the
43
- generation distribution. Each step accumulates more drift.
44
-
45
- KL coef 0.15 will slow drift rate but not fix the fundamental off-policy problem.
46
-
47
- ---
48
-
49
- ## Root Cause 3: LoRA Capacity Too Small
50
-
51
- Weight analysis shows delta norms of 0.01–0.03 after 100 steps.
52
- KL divergence between base and trained first-token distributions: **0.0013** (near zero).
53
-
54
- LoRA r=32 on a hybrid 38-Mamba + 4-Attn architecture:
55
- - Attention (4 layers): most weight-updated modules, but fewest layers
56
- - Mamba (38 layers): most changed by norm, but still tiny deltas
57
- - Effective trainable parameters: ~41M / 4B = 1.03% — insufficient for style transfer
58
-
59
- Semantic similarity between base and trained outputs: **0.541** (meaningful divergence
60
- in surface form), but KL=0.0013 means the underlying probability distributions are
61
- virtually identical. The model is sampling differently but from the same distribution.
62
-
63
- ---
64
-
65
- ## Root Cause 4: Reward Noise
66
-
67
- Each question scored against one sim response from Qwen 0.5B.
68
- Single-sample info gain has high variance — same question can score very differently
69
- across runs depending on the sim's stochastic output.
70
-
71
- This adds noise to the reward signal that the gradient cannot overcome.
72
-
73
- ---
74
-
75
- ## What Did Work
76
-
77
- - **Reward v11 design**: info gain via simulated response genuinely separates
78
- good from bad questions (Exp B: diff=+0.300, p≈0)
79
- - **Hard gates**: calibrated correctly (0.5% FP on real Lex questions)
80
- - **vLLM sleep/wake**: solves the OOM issue cleanly
81
- - **Clean eval infrastructure**: `eval_clean.py` + `held_out_eval.jsonl`
82
-
83
- ---
84
-
85
- ## Three Paths Forward
86
-
87
- ### Option A: SFT on Best GRPO Completions (Recommended First)
88
- - Collect all completions with reward ≥ 3.0 from run1 logs
89
- - Build SFT dataset: (guest, high-reward question) pairs
90
- - Train SFT 1 epoch — no RL, no off-policy issues
91
- - **Tests: is the reward signal capturing real quality?**
92
- - If SFT on high-reward completions beats base → reward is good, RL loop is the problem
93
- - If SFT also fails → reward is not capturing what we want
94
-
95
- ### Option B: Full-Weight Training
96
- - `--lora-r 0` (already implemented in grpo_v8_train.py)
97
- - All 4B params trainable — Mamba layers get full gradient
98
- - Needs micro_batch=32 (single gradient step per generation batch)
99
- - Needs true on-policy: generate and immediately train on same weights
100
- - Risk: memory, instability, slower
101
-
102
- ### Option C: On-Policy GRPO (No vLLM)
103
- - Generate with HF model directly (no vLLM intermediary)
104
- - IS ratio = 1.0 always — no off-policy drift possible
105
- - 10-20× slower generation
106
- - But clean gradient signal throughout
107
-
108
- **Decision**: Run Option A first (fast, cheap, diagnostic).
109
- If SFT on high-reward data beats base on clean eval → confirms reward is valid,
110
- proceed to Option B/C for RL. If not → fundamentally rethink the reward.
111
-
112
- ---
113
-
114
- ## Lessons
115
-
116
- 1. **Always use held-out eval** — never train on eval prompts
117
- 2. **Monitor IS clipped ratio** — >5% is a warning, >10% is broken
118
- 3. **KL coef alone doesn't fix off-policy drift** — need structural fix (micro_batch=32 or true on-policy)
119
- 4. **LoRA r=32 is insufficient for this task** — need r≥128 or full-weight for style transfer on 4B hybrid
120
- 5. **Clean eval before any training decision** — the leaky eval wasted ~10 training runs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V21_PLAN.md DELETED
@@ -1,78 +0,0 @@
1
- # GRPO v21 Plan — Longer Thinking Budget + Better Diagnostics (2026-04-03)
2
-
3
- ## Why v20 underperformed
4
-
5
- GRPO v20 (`lora/grpo-v20-think2`) finished successfully but evaluated worse than both the base model and LoRA v2 native:
6
-
7
- - Base: **0.753**
8
- - LoRA v2 native: **0.760**
9
- - GRPO v20 think2: **0.720**
10
-
11
- The main failure mode was not the reasoning parser anymore — that was fixed. The problem was **truncation**.
12
-
13
- ### Verified findings
14
-
15
- - Thinking was correctly enabled via `chat_template_kwargs={"enable_thinking": True}`
16
- - vLLM needed `structured_outputs_config={"reasoning_parser": "nemotron_v3"}`
17
- - Nemotron reasoning output often arrives as:
18
- - `reasoning text ... </think> final answer`
19
- - because the prompt already contains the opening `<think>` token
20
- - Standalone verification confirmed Nemotron does emit `</think>` and a clean final answer when generation is long enough
21
-
22
- ### Truncation analysis (v20)
23
-
24
- Using `logs/grpo_v20_think2.log`:
25
-
26
- - Steps analyzed: **135**
27
- - Steps where `thinking/token_len_min == 0`: **67** (**49.63%**)
28
- - Steps with `completions/clipped_ratio > 0`: **68** (**50.37%**)
29
- - Steps with both conditions: **67**
30
- - Therefore **100% of zero-thinking steps were clipped steps**
31
-
32
- Conclusion: the 800-token generation cap was too small. Roughly half the GRPO groups had at least one completion clipped before closing the thinking block.
33
-
34
- ## v21 changes
35
-
36
- ### Generation budget
37
- - `MAX_NEW_TOKENS`: **1600** (up from 800)
38
- - `MAX_SEQ`: **3072** (up from 1536)
39
- - `VLLM_GPU_MEM_UTIL`: **0.35** (up from 0.30)
40
-
41
- ### Starting checkpoint
42
- - Restart from **LoRA v2 native**:
43
- - `/home/bobber/lex-ft/lora/sft-lora-v2-native`
44
- - Do **not** continue from degraded GRPO v20 checkpoint
45
-
46
- ### Diagnostics added to W&B
47
-
48
- Keep existing:
49
- - `thinking/token_len_mean|min|max`
50
- - `thinking/char_len_mean|min|max`
51
- - `thinking/present_ratio`
52
- - question lengths
53
-
54
- New failure-mode-specific metrics:
55
- - `thinking/nonempty_count`
56
- - `thinking/closed_tag_ratio`
57
- - `thinking/missing_ratio`
58
- - `thinking/unclosed_ratio`
59
- - `thinking/no_answer_after_close_ratio`
60
- - `thinking/clipped_and_unclosed_ratio`
61
- - `answer/nonempty_count`
62
- - `answer/questionmark_ratio`
63
-
64
- These distinguish:
65
- 1. no thinking at all
66
- 2. thinking started but never closed
67
- 3. think block closed but no final answer
68
- 4. clipped + unclosed specifically
69
-
70
- ## Files
71
-
72
- - Training script: `scripts/train_grpo_v20.py` (repurposed as v21 launcher)
73
- - Launcher: `run_grpo_v20.sh`
74
- - Planned output: `checkpoints/grpo-v21/`, `lora/grpo-v21/`
75
-
76
- ## Launch goal
77
-
78
- Run a fresh GRPO job from LoRA v2 native with verified Nemotron reasoning parsing, enough token budget for the think block to close, and W&B metrics that expose the exact failure mode if it regresses.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V21_SUCCESS_ANALYSIS.md DELETED
@@ -1,152 +0,0 @@
1
- # Why GRPO v21 Succeeded — Formal Analysis
2
-
3
- **Date:** 2026-04-03
4
- **By:** post-hoc analysis after v21, v22, v23, v24 experiments
5
-
6
- ---
7
-
8
- ## Summary
9
-
10
- GRPO v21 scored **0.867 ± 0.231** on thinking-enabled functional eval — the best result across all training runs. This document explains precisely why, backed by measured metrics from training logs across all four runs.
11
-
12
- ---
13
-
14
- ## The Key Metric: GRPO Learning Quality Score
15
-
16
- We define a compound metric that captures what GRPO actually needs to learn:
17
-
18
- ```
19
- GRPO_score = first50_std × sign(reward_delta) × |reward_delta|^0.5
20
- ```
21
-
22
- Where:
23
- - `first50_std` = average intra-group reward std across the first 50 training steps
24
- - `reward_delta` = mean(last 20 steps reward) − mean(first 20 steps reward)
25
-
26
- | run | eval | avg_group_std | first50_std | reward_delta | high_var% | n_big_q | GRPO_score |
27
- |---|---|---|---|---|---|---|---|
28
- | **v21** | **0.867** | **0.228** | **0.268** | **+0.083** | **48.5** | **63** | **0.0778** |
29
- | v22 | 0.813 | 0.151 | 0.156 | +0.013 | 27.5 | 18 | 0.0182 |
30
- | v23 | 0.760 | 0.186 | 0.187 | -0.019 | 36.0 | 45 | -0.0036 |
31
- | v24 | 0.693 | 0.208 | 0.195 | -0.018 | 43.0 | 62 | -0.0035 |
32
-
33
- **GRPO_score is nearly perfectly correlated with final eval score.**
34
-
35
- ---
36
-
37
- ## The Underlying Mechanism
38
-
39
- ### What generates high group variance?
40
-
41
- The key measurement: `corr(q_tok, group_std) = +0.70` across all runs.
42
-
43
- - When `q_tok > 100` (meta-spill / clipped text leaks into the "answer" region): `avg_std ≈ 0.41`
44
- - When `q_tok ≤ 30` (clean short questions): `avg_std ≈ 0.13`
45
-
46
- High-variance "learning steps" are exactly the ones where some completions produce meta-spill or clipped garbage (`reward = 0`) while others produce clean interviewer questions (`reward = 0.7–1.0`).
47
-
48
- Pattern example: `[1.0, 0.0, 1.0, 1.0]` → std = 0.500
49
-
50
- This is not a problem. **This is the GRPO learning signal.** Steps with high intra-group contrast produce the largest advantage estimates, which drive the strongest policy gradient updates.
51
-
52
- ---
53
-
54
- ## Why v22 Failed Despite Fixing Clipping
55
-
56
- v22 reduced clipping from 32.5% → 9.5%. This seemed like an improvement, but it:
57
-
58
- - Reduced `n_big_q` (steps with high q_tok variance): **63 → 18**
59
- - Reduced `avg_group_std`: **0.228 → 0.151**
60
- - Reduced `high_var%`: **48.5% → 27.5%**
61
- - Reduced `reward_delta`: **+0.083 → +0.013**
62
-
63
- By eliminating natural "contrast generators" (steps where one clipped/failed completion zeroes while three good ones score 0.7–1.0), v22 starved GRPO of learning signal. The run converged but with much weaker gradient push.
64
-
65
- **The right amount of clipping was generating useful contrast. More clipping ≠ better learning.**
66
-
67
- ---
68
-
69
- ## Why v23 and v24 Failed Despite Similar Group Structure
70
-
71
- v24 had `n_big_q = 62` — essentially the same as v21's 63. The structural conditions looked similar. But:
72
-
73
- | | v21 | v24 |
74
- |---|---|---|
75
- | big-q avg_std | **0.410** | 0.386 |
76
- | big-q avg_mean reward | 0.563 | 0.511 |
77
- | small-q avg_mean reward | 0.757 | 0.766 |
78
- | reward_delta | **+0.083** | -0.018 |
79
-
80
- reward_v13's multiplicative soft penalties depressed the top end of the reward distribution:
81
- - partially-bad completions that reward_v12 zeros → reward_v13 gives 0.6 (reduced contrast)
82
- - clean good completions get penalized for generic openers → upper bound compressed
83
-
84
- The result: less variance in the high-q_tok steps, and overall reward trending downward. GRPO had no uphill direction to optimize toward.
85
-
86
- ---
87
-
88
- ## The Formal Model of v21's Success
89
-
90
- v21 succeeded because **all three** of these conditions were simultaneously satisfied:
91
-
92
- ### Condition 1: Token budget → right natural failure rate
93
-
94
- - 1600 tokens → ~32% of steps had at least one clipped/failed completion
95
- - P(at least 1 zero per group of 4) ≈ 0.30 → avg group std ≈ 0.228
96
-
97
- **The Goldilocks zone:**
98
- - Budget too large (v22): fewer zeros → std collapses → no signal
99
- - Budget too small (v20): too many zeros → all groups degenerate → no signal
100
-
101
- ### Condition 2: reward_v12 creates maximal binary contrast
102
-
103
- - Hard gate: clipped text fails `ends with ?` check → reward = 0.0 exactly
104
- - Clean questions: reward = 0.7–1.0
105
- - This binary gap maximizes intra-group advantage
106
-
107
- reward_v13 added partial penalties (×0.10, ×0.35, ×0.70), which smoothed rather than sharpened the contrast. A penalized bad completion at ~0.6 reward creates far less learning signal than a hard-zero.
108
-
109
- ### Condition 3: Starting policy at the right distance from the optimum
110
-
111
- - `sft-lora-v2-native` started at reward_first20 = 0.631
112
- - Clear room to climb to 0.715 → delta = +0.083
113
- - The reward landscape was still uphill from that starting point
114
-
115
- v23 started from v21 which had already climbed most of the hill. With reward_v13's stricter objective, the starting point was at or above the new optimum → reward went downhill.
116
-
117
- ---
118
-
119
- ## The Formula
120
-
121
- ```
122
- GRPO_success = Prob(at least 1 zero in group) ≈ 0.25–0.35
123
- × hard_binary_reward_gate (zeros are truly zero, goods are 0.7+)
124
- × starting_below_optimum (reward can still increase)
125
- ```
126
-
127
- This is not luck. v21 hit a **Goldilocks combination** that maximized GRPO's learning efficiency: enough zeros to create contrast (but not too many), a reward that makes zeros hard and goods strong, and a starting point with room to improve.
128
-
129
- ---
130
-
131
- ## Implications for Future Runs
132
-
133
- This is fully replicable. To exceed v21:
134
-
135
- 1. **Keep a hard binary gate** — zeros when clearly wrong (no partial credit for failures)
136
- 2. **Keep a budget where ~25–33% of steps naturally produce at least one zero**
137
- 3. **Ensure the starting policy has room to improve** (don't start from the current best checkpoint under the same reward)
138
- 4. **Make the reward more discriminative at the top end** — push from 0.7 → 0.95 for genuinely excellent questions, rather than adding penalties at the bottom
139
-
140
- The insight is that GRPO learns best from **contrast**, not from **correctness**. A step where three completions are excellent and one is terrible teaches more than a step where all four are mediocre.
141
-
142
- ---
143
-
144
- ## Measured Data
145
-
146
- All data extracted from training logs:
147
- - `/home/bobber/lex-ft/logs/grpo_v21.log`
148
- - `/home/bobber/lex-ft/logs/grpo_v22.log`
149
- - `/home/bobber/lex-ft/logs/grpo_v23.log`
150
- - `/home/bobber/lex-ft/logs/grpo_v24.log`
151
-
152
- Final eval scores from thinking-enabled functional eval using Qwen3.5-4B judges (on_topic × uses_guest × probing) on 25 held-out prompts.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V22_PLAN.md DELETED
@@ -1,61 +0,0 @@
1
- # GRPO v22 Plan — 2560 Token Budget + Clip Penalty (2026-04-03)
2
-
3
- ## Motivation
4
-
5
- GRPO v21 proved that the thinking-enabled path works and that GRPO can beat the base model under thinking-enabled eval:
6
-
7
- - base (thinking-enabled): **0.760**
8
- - LoRA v2 native (thinking-enabled): **0.707**
9
- - GRPO v21 (thinking-enabled): **0.867**
10
-
11
- However, v21 still had clipped long-run samples during training.
12
-
13
- ## Measured v21 clipping rate
14
-
15
- From the full 200-step training log:
16
-
17
- - Steps with any clipped completion: **65 / 200 = 32.5%**
18
- - Steps with `thinking/present_ratio < 1.0`: **63 / 200 = 31.5%**
19
- - Steps with `thinking/closed_tag_ratio < 1.0`: **63 / 200 = 31.5%**
20
- - 100% of low-present / low-closed steps overlapped with clipping
21
- - Average clipped ratio across all steps: **0.092**
22
- - Steps with 50% or more completions clipped: **9 / 200 = 4.5%**
23
-
24
- So v21 improved substantially over v20, but clipping still polluted a meaningful fraction of training steps.
25
-
26
- ## v22 changes
27
-
28
- ### Larger generation budget
29
- - `MAX_NEW_TOKENS = 2560`
30
- - `MAX_SEQ = 4096`
31
-
32
- This is a step up from v21's 1600/3072, without going all the way to 4096 new tokens.
33
-
34
- ### Explicit clipped-completion penalty
35
- - `CLIP_PENALTY = 0.10`
36
-
37
- Current implementation:
38
- - detect a locally clipped completion when generated token count is within ~4 tokens of `MAX_NEW_TOKENS`
39
- - subtract **0.10** from the reward for that completion
40
- - floor at **0.0** (no negative rewards from clipping alone)
41
-
42
- This is intentionally small:
43
- - enough to discourage runaway / truncated outputs
44
- - not so large that mildly clipped but otherwise strong samples dominate the training signal
45
-
46
- ### Diagnostics retained / extended
47
- - `thinking/nonempty_count`
48
- - `thinking/closed_tag_ratio`
49
- - `thinking/missing_ratio`
50
- - `thinking/unclosed_ratio`
51
- - `thinking/no_answer_after_close_ratio`
52
- - `thinking/clipped_and_unclosed_ratio`
53
- - `answer/nonempty_count`
54
- - `answer/questionmark_ratio`
55
- - `completion/clipped_ratio_local`
56
- - `reward/clip_penalty`
57
-
58
- ## Files
59
- - Training script: `scripts/train_grpo_v20.py`
60
- - Launcher: `run_grpo_v20.sh`
61
- - Planned output: `checkpoints/grpo-v22/`, `lora/grpo-v22/`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V23_PLAN.md DELETED
@@ -1,68 +0,0 @@
1
- # GRPO v23 Plan — reward_v13 as a Clean Reward Ablation from GRPO v21
2
-
3
- ## Goal
4
-
5
- Test whether `reward_v13` improves the policy **without changing the generation regime** from the current best run.
6
-
7
- This is intentionally a cleaner experiment than v22:
8
- - **same generation budget as GRPO v21**
9
- - **start from GRPO v21**
10
- - **change reward only**
11
-
12
- ## Why this design
13
-
14
- GRPO v21 is still the best checkpoint on thinking-enabled eval:
15
- - GRPO v21: **0.867 ± 0.231**
16
-
17
- GRPO v22 reduced clipping substantially but underperformed v21:
18
- - GRPO v22: **0.813 ± 0.314**
19
-
20
- So the next question is not "does a larger budget help?" — v22 already answered that imperfectly.
21
- The next question is:
22
-
23
- > Can a better reward improve on v21 while keeping the same generation regime?
24
-
25
- ## Config
26
-
27
- ### Start checkpoint
28
- - `/home/bobber/lex-ft/lora/grpo-v21`
29
-
30
- ### Reward
31
- - `reward_v13`
32
-
33
- ### Generation / sequence length
34
- - `MAX_NEW_TOKENS=1600`
35
- - `MAX_SEQ=3072`
36
-
37
- ### Other key settings
38
- - `CLIP_PENALTY=0.10`
39
- - `VLLM_GPU_MEM_UTIL=0.35`
40
- - `NUM_PROMPTS=1000`
41
- - `MAX_STEPS=200`
42
-
43
- ## What reward_v13 changes
44
-
45
- Relative to v12:
46
- - keeps `uses_guest` and `probing`
47
- - adds explicit penalties for:
48
- - meta spill (`the user is asking...`, etc.)
49
- - generic opener patterns with weak guest anchoring
50
- - obvious drift patterns
51
- - excessively long hidden thinking
52
- - does **not** reward longer thinking directly
53
-
54
- ## Success criterion
55
-
56
- Primary:
57
- - beat GRPO v21 on thinking-enabled functional eval
58
-
59
- Secondary:
60
- - reduce explicit meta-spill frequency
61
- - avoid the generic-short-question drift seen in v22
62
-
63
- ## Evaluation protocol
64
-
65
- After training, run the same thinking-enabled leaderboard:
66
- - `base`
67
- - `/home/bobber/lex-ft/lora/sft-lora-v2-native`
68
- - `/home/bobber/lex-ft/lora/grpo-v23`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V24_PLAN.md DELETED
@@ -1,33 +0,0 @@
1
- # GRPO v24 Plan — reward_v13, reset to LoRA v2 native start
2
-
3
- ## Motivation
4
-
5
- v23 showed that starting from grpo-v21 (already a strong policy) + stricter reward_v13
6
- compressed reward variance → GRPO had less signal to learn from → eval regressed to base.
7
-
8
- The v21 high score (0.867) came from starting at a *weaker* policy (LoRA v2 native) under
9
- a *simpler* reward (v12), giving wide exploration space with steep advantage gradients.
10
-
11
- Key insight:
12
- - v21 was v12's local optimum but is NOT guaranteed to be v13's best starting point
13
- - Resetting to LoRA v2 native restores exploration room under the new reward geometry
14
-
15
- ## Config
16
-
17
- - START_ADAPTER: `/home/bobber/lex-ft/lora/sft-lora-v2-native`
18
- - REWARD_MODULE: `reward_v13`
19
- - MAX_NEW_TOKENS: 1600
20
- - MAX_SEQ: 3072
21
- - CLIP_PENALTY: 0.10
22
- - NUM_PROMPTS: 1000
23
- - MAX_STEPS: 200
24
-
25
- ## Hypothesis
26
-
27
- reward_v13 will find a different (and hopefully better) local optimum when given a fresh
28
- exploration budget from a weaker starting point, rather than inheriting v21's policy which
29
- was already locally optimal under the softer v12 reward.
30
-
31
- ## Success criterion
32
-
33
- Beat grpo-v21 (0.867) on thinking-enabled functional eval.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V3_POSTMORTEM.md DELETED
@@ -1,221 +0,0 @@
1
- # GRPO v3 Postmortem — Off-Policy RL on Hybrid Mamba Architecture
2
-
3
- > **Status:** ❌ FAILED — LoRA merged model generates gibberish despite positive training rewards
4
- > **Date:** 2026-03-22 to 2026-03-23
5
- > **Duration:** 9.72 hours (583 min), 125 steps
6
- > **wandb:** [`lex-interviewer-grpo-v3`](https://wandb.ai/bobber-cheng/lex-interviewer/runs/zm1khost)
7
-
8
- ---
9
-
10
- ## Architecture
11
-
12
- ```
13
- ┌──────────────┐ completions ┌──────────────┐
14
- │ llama.cpp │ ──────────────────>│ Reward v3 │──── rewards
15
- │ Q4_K_M base │ │ (heuristic) │
16
- │ (no LoRA) │ └──────────────┘
17
- └──────────────┘ │
18
- ▲ │
19
- │ generation ▼
20
- │ (off-policy) ┌──────────────┐
21
- │ │ GRPO Loss │
22
- │ │ advantages │
23
- │ └──────┬───────┘
24
- │ │
25
- │ ▼
26
- │ ┌──────────────┐
27
- │ │ HF Model │
28
- │ │ + LoRA │──── gradient update
29
- │ │ (forward │
30
- │ │ pass only) │
31
- └───────── NOT connected ───┘──────────────┘
32
- ```
33
-
34
- **The fatal flaw:** llama.cpp generates completions from the **base model** (no LoRA). The reward scores those base-model completions. But gradient updates go to the **LoRA model**. The LoRA never generates its own text — it only assigns log-probabilities to text produced by a different model.
35
-
36
- ---
37
-
38
- ## Training Config
39
-
40
- | Parameter | Value |
41
- |-----------|-------|
42
- | Base model | Nemotron-3-Nano-4B (hybrid: 38 Mamba-2 + 4 Attention) |
43
- | Generation model | Q4_K_M GGUF via llama.cpp (2.9 GB) |
44
- | Training model | HF + LoRA (rank 32, 0.38% trainable params) |
45
- | LoRA targets | All layers: q/k/v/o_proj (attention), in/out_proj (Mamba), up/down_proj (MLP) |
46
- | NUM_PROMPTS | 500 |
47
- | NUM_GENERATIONS | 8 per prompt |
48
- | MAX_COMPLETION_TOKENS | 800 |
49
- | BATCH_SIZE | 4 |
50
- | GRAD_ACCUM | 4 (effective batch = 16) |
51
- | Learning rate | 5e-5 |
52
- | Beta (KL coefficient) | 0.04 |
53
- | Thinking mode | Enabled |
54
-
55
- ---
56
-
57
- ## Results
58
-
59
- ### Training Metrics (Misleading)
60
- | Steps | Avg Reward | % Positive |
61
- |-------|-----------|------------|
62
- | 26–35 | -0.201 | 0% (cold start) |
63
- | 36–45 | +0.115 | 90% |
64
- | 46–55 | +0.179 | 90% |
65
- | 56–65 | +0.174 | 100% |
66
- | 66–75 | +0.220 | 100% |
67
- | 100–125 | +0.087 to +0.359 | mixed |
68
-
69
- Training rewards looked healthy. **But these rewards measured the base model's generation quality, not the LoRA's.**
70
-
71
- ### Actual Generation Quality (Ground Truth)
72
-
73
- **Base model (Q4_K_M, llama.cpp):**
74
- > *"When you say 'failed three times,' what did it feel like in your body? Was it the weight of the third crash?"*
75
-
76
- **LoRA merged model (step 50, GGUF):**
77
- > `"I think and to "I" "I" "you are, "I" "i ", I, "i. "i" to "i"`
78
-
79
- The LoRA produces complete gibberish. Every checkpoint tested (step 25, 50) showed the same pattern.
80
-
81
- ---
82
-
83
- ## Root Cause Analysis: 6 Critical Gaps
84
-
85
- ### 🔴 Gap 1: Off-Policy Generation (Fatal)
86
-
87
- llama.cpp generates from the **base model** — no LoRA weights applied. The LoRA model is told "make this text more likely" for text it would never produce itself. Over 125 steps, the LoRA drifts into a completely different distribution.
88
-
89
- **In standard GRPO:** The same model that generates completions is the one that gets updated. Policy improvement is on-policy — the model improves at generating text it actually produces.
90
-
91
- **In our setup:** The generator (llama.cpp base) and the learner (HF + LoRA) are two completely different models. The LoRA has no way to self-correct because it never sees what its own generations look like.
92
-
93
- ### 🔴 Gap 2: No Reference Policy / KL Divergence (Fatal)
94
-
95
- Real GRPO computes `KL(π_θ || π_ref)` — the divergence between the current policy and a frozen reference copy. This acts as an anchor, preventing the policy from drifting too far.
96
-
97
- The script uses `-β * mean(log_probs)` as a "KL proxy." This is just a confidence regularizer — it penalizes the model for being too certain about anything, but it does NOT measure how far the LoRA has drifted from the base. Without a proper reference, there's no upper bound on divergence.
98
-
99
- **What should have been done:** Store the initial LoRA log-probs (or base model log-probs) as a frozen reference and compute `log_probs_current - log_probs_ref` as the KL penalty.
100
-
101
- ### 🔴 Gap 3: Token Truncation to 512 (Severe)
102
-
103
- ```python
104
- inputs = tokenizer(full_text, return_tensors='pt', truncation=True, max_length=512)
105
- prompt_ids = tokenizer(pt, return_tensors='pt', truncation=True, max_length=384)
106
- ```
107
-
108
- Generation produces up to 800 tokens, but the log-prob forward pass truncates to 512 total (with prompt eating ~384). That leaves ~128 tokens of completion visible to the gradient — but the reward was computed on the full 800-token completion.
109
-
110
- **Effect:** The LoRA optimizes the first ~128 tokens while the reward evaluates the full response. The end of each completion is a gradient-free zone. This creates a systematic mismatch between what's rewarded and what's learned.
111
-
112
- ### 🟡 Gap 4: Architecture Mismatch — Mamba Layers Unchanged (Fundamental)
113
-
114
- NemotronH has 42 layers: 38 Mamba-2 + 4 Attention. LoRA can add adapters to linear projections (`in_proj`, `out_proj` in Mamba; `q/k/v/o_proj` in Attention), but:
115
-
116
- - Mamba's core behavior comes from its **SSM recurrence** (`A_log`, `D`, `conv1d`, `dt_bias`) — these are NOT touched by LoRA
117
- - The `in_proj`/`out_proj` adapters on Mamba layers only change the input/output projections, not the state-space dynamics
118
- - The 4 attention layers are the only layers where LoRA meaningfully alters the computation
119
-
120
- So LoRA modifies the periphery of 38 layers and the core of 4 layers. In generation, the 38 untouched Mamba layers still dominate the sequence modeling. The 4 LoRA-modified attention layers can't override what the Mamba layers decide.
121
-
122
- ### 🟡 Gap 5: Token-Level vs Sequence-Level Credit Assignment
123
-
124
- The reward function scores the complete visible response (sequence-level: "is this a good Lex question?"). But the loss distributes this reward equally across all tokens:
125
-
126
- ```python
127
- token_lp = torch.gather(log_probs, 1, completion_ids.unsqueeze(1)).squeeze(1)
128
- log_probs_list.append(token_lp.mean()) # <-- equal weight per token
129
- ```
130
-
131
- In a +0.7 completion, every token gets the same +0.7 advantage — including "The", "a", filler words, punctuation. No credit assignment to the tokens that actually made the response good (the question itself, specific word choices).
132
-
133
- This is standard in sequence-level GRPO, but combined with the other gaps, it means the gradient signal is diffuse and noisy.
134
-
135
- ### 🟡 Gap 6: Thinking Content in Training, Not in Reward
136
-
137
- ```python
138
- completion_texts.append(c['raw']) # includes <think>...</think>
139
- ```
140
-
141
- The LoRA computes log-probs over the full completion including `<think>` tags. But the reward function calls `strip_think()` and scores only the visible content. This means:
142
-
143
- - 400+ tokens of thinking content get gradient signal proportional to the visible-content reward
144
- - The LoRA is optimizing thinking patterns based on a reward that doesn't evaluate thinking
145
- - A completion with brilliant thinking but weak visible output would train the LoRA to produce that exact thinking — even though the reward says it's bad
146
-
147
- ---
148
-
149
- ## Compound Effect
150
-
151
- These gaps compound multiplicatively:
152
-
153
- 1. **Gaps 1+2:** Off-policy generation + no KL anchor = unconstrained divergence. The LoRA wanders freely.
154
- 2. **Gap 3:** Truncation means the gradient operates on different tokens than the reward evaluates.
155
- 3. **Gap 6:** Thinking tokens dilute the gradient signal — most of the sequence is thinking, but the reward ignores it.
156
- 4. **Gap 4:** Even if the gradients were correct, LoRA can barely influence generation on this architecture.
157
- 5. **Gap 5:** Even the correct tokens get uniform gradient weight, with no credit assignment.
158
-
159
- The result: gradients that are wrong (off-policy), on the wrong tokens (truncated), for the wrong content (thinking), with the wrong weights (uniform), updating the wrong layers (LoRA on Mamba periphery). The only surprise is that it took 50 steps to produce gibberish rather than happening immediately.
160
-
161
- ---
162
-
163
- ## GGUF Converter Fix (Side Quest)
164
-
165
- During evaluation, we discovered a bug in llama.cpp's `convert_hf_to_gguf.py` for NemotronH non-MoE models:
166
-
167
- **Root cause:** `NemotronHConfig` in HuggingFace transformers defines MoE default values (`num_experts_per_tok=2`, `moe_intermediate_size=7688`). The converter uses `AutoConfig.from_pretrained()` which loads these defaults even for non-MoE models. The converter then detects `"num_experts_per_tok" in hparams` and sets `architecture = nemotron_h_moe` instead of `nemotron_h`.
168
-
169
- **Fix applied:**
170
- 1. Override MoE defaults in `config.json`: set `num_experts_per_tok=0`
171
- 2. Patch converter to check `hparams.get("num_experts_per_tok", 0) > 0` instead of `"num_experts_per_tok" in hparams`
172
- 3. Guard the MoE metadata section in `set_gguf_parameters` with `if "num_experts_per_tok" in self.hparams:`
173
-
174
- **Files modified:** `/home/bobber/llama.cpp/convert_hf_to_gguf.py` (backup at `.bak`)
175
-
176
- ---
177
-
178
- ## What Would Fix This
179
-
180
- ### Option A: On-Policy GRPO
181
- Periodically merge LoRA → GGUF → use merged model for generation. Expensive (merge + convert every N steps) but fixes Gap 1.
182
-
183
- ### Option B: SFT on Curated Completions
184
- Use the ~1000 high-reward completions from training as supervised data. No off-policy gap, no reward mismatch. The base model generated excellent Lex-style questions — just teach the LoRA to produce them via SFT.
185
-
186
- ### Option C: Ship the Base Model
187
- The base Nemotron 4B already scores **4.35/5** — better than GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro on this task. System prompt alone achieves top-tier interviewer behavior. Fine-tuning may not be necessary.
188
-
189
- ---
190
-
191
- ## Files
192
-
193
- | File | Purpose |
194
- |------|---------|
195
- | `scripts/train_grpo_v3.py` | Training script (the one with all the gaps) |
196
- | `scripts/run_grpo_v3.sh` | Detached wrapper (setsid+nohup) |
197
- | `scripts/reward_v3.py` | Reward function (heuristic scoring) |
198
- | `logs/grpo_v3_completions.jsonl` | All logged completions + rewards |
199
- | `logs/run_grpo_v3.log` | Training log |
200
- | `models/lex-interviewer-grpo-lora-v3` | Final LoRA adapter (125 steps) |
201
- | `models/lex-interviewer-grpo-lora-v3-step{25,50,75,100,125}` | Checkpoints |
202
-
203
- ---
204
-
205
- ## Lessons for Future RL Projects
206
-
207
- 1. **Off-policy RL requires explicit policy constraints.** Without proper KL divergence from a reference, the policy will diverge. The "KL proxy" shortcut doesn't work.
208
-
209
- 2. **Validate generation quality mid-training.** We didn't merge and test the LoRA until step 50. If we had tested at step 10, we'd have caught the gibberish 100 steps earlier.
210
-
211
- 3. **Training rewards ≠ model quality** when the generator and learner are different models. Always eval the actual learner, not just the reward signal.
212
-
213
- 4. **Hybrid architectures (Mamba + Attention) are hostile to LoRA-based RL.** The SSM dynamics that control generation are not reachable by LoRA adapters. Full parameter updates or architecture-aware training is needed.
214
-
215
- 5. **Token truncation in log-prob computation is a silent killer.** If max_length in training doesn't match max_tokens in generation, the gradient literally operates on the wrong tokens.
216
-
217
- 6. **Thinking mode adds complexity to RL.** When the reward ignores thinking content but the gradient doesn't, you're training the model to optimize a signal it can't see.
218
-
219
- ---
220
-
221
- *Created: 2026-03-23*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V4_DESIGN.md DELETED
@@ -1,423 +0,0 @@
1
- # GRPO v4 Design — llama.cpp Generation + PyTorch LoRA Training
2
-
3
- > Date: 2026-03-24
4
- > Status: Design (paper version)
5
-
6
- ## Overview
7
-
8
- On-policy GRPO using llama.cpp for generation and PyTorch for LoRA training. This avoids the SM 12.1 toolchain gaps that blocked NeMo RL, vLLM-based approaches, and custom CUDA kernels on GB10.
9
-
10
- ## Why This Works on GB10
11
-
12
- | Component | Framework | SM 12.1 Status |
13
- |-----------|-----------|----------------|
14
- | Generation | llama.cpp | ✅ Native support |
15
- | Log-probs | llama.cpp | ✅ `--logprobs` flag |
16
- | Model loading | llama.cpp | ✅ GGUF + LoRA adapter |
17
- | Training forward | PyTorch (torch_forward) | ✅ Pure PyTorch, no custom CUDA |
18
- | Training backward | PyTorch autograd | ✅ Standard |
19
- | LoRA conversion | Python (tensor I/O) | ✅ No GPU needed |
20
-
21
- ## Architecture
22
-
23
- ```
24
- ┌─────────────────────────────────────────────────────┐
25
- │ GRPO Training Loop │
26
- │ │
27
- │ ┌──────────────┐ ┌──────────────┐ │
28
- │ │ llama.cpp │ │ PyTorch │ │
29
- │ │ server │ │ Training │ │
30
- │ │ │ │ │ │
31
- │ │ base.gguf │ │ HF model │ │
32
- │ │ + lora.gguf │ │ + LoRA │ │
33
- │ │ │ │ │ │
34
- │ │ Generate │ │ Forward │ │
35
- │ │ completions │───▶│ Compute loss │ │
36
- │ │ + log-probs │ │ Backward │ │
37
- │ │ │ │ Update LoRA │ │
38
- │ │ │◀───│ Write GGUF │ │
39
- │ │ Hot-swap │ │ │ │
40
- │ │ /lora-adapt │ │ │ │
41
- │ └──────────────┘ └──────────────┘ │
42
- │ GPU GPU │
43
- │ (unified memory) (unified memory) │
44
- └─────────────────────────────────────────────────────┘
45
- ```
46
-
47
- ## Training Loop (Pseudocode)
48
-
49
- ```python
50
- # === INITIALIZATION ===
51
-
52
- # 1. Start llama.cpp server with base model + initial LoRA (identity)
53
- server = start_llama_server(
54
- model="models/nemotron-f16.gguf", # BF16 GGUF — matches PyTorch training precision
55
- lora="lora/current.gguf", # starts as identity (zeros)
56
- port=8080,
57
- ctx_size=1024,
58
- n_gpu_layers=-1, # all layers on GPU
59
- )
60
-
61
- # 2. Load HF model for training (torch_forward, no CUDA kernels)
62
- hf_model = AutoModelForCausalLM.from_pretrained(
63
- "models/NVIDIA-Nemotron-3-Nano-4B",
64
- torch_dtype=torch.bfloat16,
65
- trust_remote_code=True,
66
- device_map="cuda",
67
- )
68
-
69
- # 3. Apply LoRA to HF model
70
- lora_model = apply_lora(hf_model, rank=64, alpha=256, target="all_linear")
71
-
72
- # 4. Load reference model log-probs (frozen copy for KL penalty)
73
- # Option A: Use llama.cpp with base model only (no LoRA) for reference
74
- # Option B: Cache reference log-probs per prompt (cheaper)
75
-
76
- # 5. Reward function
77
- def reward_fn(prompt, completion):
78
- """Score interviewer quality: question relevance, brevity, not lecturing."""
79
- score = 0.0
80
- # ... (custom scoring logic)
81
- return score
82
-
83
- # 6. Optimizer
84
- optimizer = torch.optim.AdamW(lora_model.lora_parameters(), lr=5e-6)
85
-
86
-
87
- # === TRAINING LOOP ===
88
-
89
- for step in range(num_steps):
90
- log.info(f"=== Step {step} ===")
91
-
92
- # ── Phase 1: Sample prompts ──
93
- prompts = sample_prompts(dataset, batch_size=num_prompts)
94
- log.info(f"Sampled {len(prompts)} prompts")
95
-
96
- # ── Phase 2: Generate completions (llama.cpp, on-policy) ──
97
- completions = []
98
- gen_logprobs = []
99
- for prompt in prompts:
100
- for _ in range(num_generations_per_prompt):
101
- result = llama_generate(
102
- server,
103
- prompt=prompt,
104
- max_tokens=800,
105
- temperature=1.0,
106
- logprobs=True,
107
- )
108
- completions.append(result.text)
109
- gen_logprobs.append(result.logprobs)
110
-
111
- log.info(f"Generated {len(completions)} completions, "
112
- f"avg length: {mean_tokens(completions)}")
113
-
114
- # ── Phase 3: Compute rewards ──
115
- rewards = [reward_fn(p, c) for p, c in zip(repeated_prompts, completions)]
116
- log.info(f"Rewards: mean={mean(rewards):.3f}, std={std(rewards):.3f}")
117
-
118
- # ── Phase 4: Compute advantages (GRPO) ──
119
- # Group by prompt, normalize within group
120
- advantages = compute_grpo_advantages(
121
- rewards,
122
- num_prompts=num_prompts,
123
- num_generations=num_generations_per_prompt,
124
- )
125
-
126
- # ── Phase 5: Compute training log-probs (PyTorch, current policy) ──
127
- train_logprobs = []
128
- for prompt, completion in zip(repeated_prompts, completions):
129
- tokens = tokenizer.encode(prompt + completion)
130
- with torch.no_grad(): # just log-probs, not training yet
131
- logits = lora_model(tokens)
132
- lp = compute_token_logprobs(logits, tokens)
133
- train_logprobs.append(lp)
134
-
135
- # ── Phase 6: Compute reference log-probs (base model, no LoRA) ──
136
- ref_logprobs = []
137
- for prompt, completion in zip(repeated_prompts, completions):
138
- result = llama_logprobs_only(
139
- server_ref, # separate server or base model without LoRA
140
- prompt=prompt,
141
- completion=completion,
142
- )
143
- ref_logprobs.append(result)
144
-
145
- # ── Phase 7: GRPO policy gradient loss ──
146
- optimizer.zero_grad()
147
-
148
- total_loss = 0.0
149
- for i in range(len(completions)):
150
- # Policy ratio
151
- ratio = torch.exp(train_logprobs[i] - gen_logprobs[i].detach())
152
-
153
- # Clipped ratio
154
- clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
155
-
156
- # Policy loss (PPO-style with GRPO advantages)
157
- pg_loss = -torch.min(ratio * advantages[i], clipped * advantages[i])
158
-
159
- # KL penalty (against reference policy)
160
- kl = train_logprobs[i] - ref_logprobs[i]
161
- kl_loss = kl_coef * kl
162
-
163
- loss = (pg_loss + kl_loss).mean()
164
- total_loss += loss
165
-
166
- total_loss = total_loss / len(completions)
167
- total_loss.backward()
168
-
169
- grad_norm = torch.nn.utils.clip_grad_norm_(
170
- lora_model.lora_parameters(), max_norm=1.0
171
- )
172
- optimizer.step()
173
-
174
- log.info(f"Loss: {total_loss.item():.4f}, Grad norm: {grad_norm:.4f}")
175
-
176
- # ── Phase 8: Sync LoRA to llama.cpp ──
177
- write_lora_gguf(lora_model, "lora/current.gguf")
178
- hot_swap_lora(server, "lora/current.gguf")
179
-
180
- log.info(f"LoRA synced to llama.cpp server")
181
-
182
- # ── Phase 9: Periodic eval ──
183
- if step % eval_every == 0:
184
- eval_score = run_eval(server, eval_prompts)
185
- log.info(f"Eval score: {eval_score:.2f}/5")
186
- ```
187
-
188
- ## Key Design Decisions
189
-
190
- ### 1. Log-prob Source Consistency
191
-
192
- **Problem:** GRPO needs log-probs from the *generating* policy. If llama.cpp generates and PyTorch computes training log-probs, they must be consistent enough for the ratio `π_new/π_old` to be meaningful.
193
-
194
- **Approach:**
195
- - Generation log-probs: from llama.cpp (during generation, free)
196
- - Training log-probs: from PyTorch (needs gradient, must use PyTorch)
197
- - Reference log-probs: from llama.cpp base model (no LoRA)
198
-
199
- **Approach:** Use BF16 GGUF for generation (not Q8) to eliminate precision mismatch. Both llama.cpp and PyTorch operate on identical BF16 weights, so log-probs should be numerically close. Minor differences may still arise from different attention implementations (llama.cpp's custom kernels vs PyTorch's eager), but these should be small enough for stable training.
200
-
201
- **Validation:** Gap test #1 measures the actual divergence between llama.cpp BF16 and PyTorch BF16 log-probs.
202
-
203
- ### 2. Reference Policy
204
-
205
- **Options:**
206
- - **Option A (simple):** Start a second llama.cpp server with just the base model (no LoRA). Compute reference log-probs via API. Costs ~8 GB extra VRAM.
207
- - **Option B (cheaper):** Use PyTorch base model (before LoRA) for reference. Run once at start, cache reference log-probs per training sample.
208
- - **Option C (NeMo RL approach):** Don't use a separate reference model. Compute KL from the ratio of current vs generation-time log-probs.
209
-
210
- Recommend **Option A** — simplest, 8 GB is affordable on 130 GB.
211
-
212
- ### 3. Generation Speed vs On-Policy Correctness
213
-
214
- llama.cpp BF16 generates at ~30 tok/s (slower than Q8's ~60 tok/s due to larger model). For a batch of 16 prompts × 8 generations × 800 tokens:
215
- - Total tokens: 102,400
216
- - Time: ~57 minutes per step
217
-
218
- This is slow but **correct** (on-policy, precision-matched). The GRPO v3 failure was fundamentally about off-policy, not speed.
219
-
220
- **Speedup options (if needed later):**
221
- - Reduce `num_generations_per_prompt` from 8 to 4 (~28 min/step)
222
- - Reduce `max_tokens` from 800 to 400 (~28 min/step)
223
- - Batch multiple prompts via llama.cpp server (concurrent requests)
224
- - Use Q8 if log-prob divergence test shows it's acceptable (2x faster)
225
-
226
- ### 4. LoRA Architecture
227
-
228
- Based on NemotronH architecture (42 layers: 38 Mamba-2 + 4 Attention):
229
-
230
- ```yaml
231
- LoRA config:
232
- rank: 64
233
- alpha: 256 # scaling = 4x
234
- target_modules: all linear layers
235
- # On GB10, torch_forward gives gradients through ALL layers
236
- # Unlike CUDA kernel path, no need to exclude out_proj
237
- exclude_modules: []
238
- ```
239
-
240
- **Memory estimate:**
241
- - LoRA params (rank 64, all linear): ~100 MB
242
- - Optimizer states (AdamW): ~200 MB
243
- - HF model (BF16): ~8 GB
244
- - llama.cpp base (BF16 GGUF): ~8 GB
245
- - llama.cpp reference (BF16 GGUF): ~8 GB (if Option A)
246
- - Activations/gradients: ~5-10 GB
247
- - **Total: ~35-40 GB | Free: ~90 GB** ← very comfortable
248
-
249
- ### 5. Reward Function (Lex Interviewer)
250
-
251
- ```python
252
- def interviewer_reward(prompt: str, completion: str) -> float:
253
- """
254
- Score how well the completion acts as an interviewer.
255
-
256
- Criteria:
257
- 1. Asks a question (not lectures) → binary check
258
- 2. Question relevance to conversation → semantic similarity
259
- 3. Brevity (good interviewers are concise) → length penalty
260
- 4. Follow-up quality (builds on previous answer) → coherence
261
- 5. Not repetitive → novelty check
262
- """
263
- score = 0.0
264
-
265
- # Must contain a question
266
- if "?" in completion:
267
- score += 1.0
268
-
269
- # Brevity bonus (under 100 words is good)
270
- words = len(completion.split())
271
- if words < 50:
272
- score += 1.0
273
- elif words < 100:
274
- score += 0.5
275
- elif words > 200:
276
- score -= 1.0 # lecturing penalty
277
-
278
- # Not starting with "用户问" or similar template patterns
279
- if not completion.strip().startswith(("用户问", "User asks", "The user")):
280
- score += 1.0
281
-
282
- # Ends with a question (interviewer should be prompting, not concluding)
283
- sentences = completion.strip().split(".")
284
- if sentences[-1].strip().endswith("?"):
285
- score += 1.0
286
-
287
- # Quality check via heuristic (or LLM judge, more expensive)
288
- score += heuristic_quality_score(prompt, completion) # 0-1
289
-
290
- return score # 0-5 range
291
- ```
292
-
293
- ### 6. Logging & Tracing
294
-
295
- Every step logs:
296
- ```
297
- Step | Gen Time | Train Time | Sync Time | Loss | Grad Norm | Reward (mean/std) | Eval Score
298
- 0 | 28m | 45s | 2s | 2.31 | 0.42 | 2.1 / 0.8 | 4.35
299
- 1 | 28m | 45s | 2s | 2.15 | 0.38 | 2.4 / 0.7 | --
300
- ...
301
- ```
302
-
303
- W&B integration for:
304
- - Reward distribution per step
305
- - Loss curves
306
- - Generation samples (text)
307
- - Log-prob divergence (llama.cpp vs PyTorch)
308
- - LoRA weight norms per layer
309
- - Eval score trend
310
-
311
- ## Gaps to Fill Before Implementation
312
-
313
- 1. **Log-prob consistency test:** Generate with llama.cpp BF16 GGUF, compute log-probs in both llama.cpp and PyTorch BF16. Measure divergence. Both use identical precision — divergence should be minimal.
314
- 2. **LoRA ↔ GGUF conversion:** Write `write_lora_gguf()` function. Verify llama.cpp loads the adapter and output changes.
315
- 3. **HF model loading with torch_forward:** Confirm model loads and trains without causal_conv1d (it should fall back, but need to verify loss is reasonable and gradients flow through all 42 layers).
316
- 4. **Reward function tuning:** The heuristic reward above is a starting point. May need LLM-as-judge for quality scoring.
317
-
318
- ## Comparison with Previous Attempts
319
-
320
- | | GRPO v3 (failed) | NeMo RL (blocked) | GRPO v4 (this) |
321
- |---|---|---|---|
322
- | On-policy | ❌ llama.cpp ≠ HF | ✅ vLLM = HF | ✅ llama.cpp + LoRA = HF + LoRA |
323
- | KL reference | ❌ None | ✅ Built-in | ✅ Base model via llama.cpp |
324
- | LoRA coverage | 4/42 layers | All layers | All layers (torch_forward) |
325
- | SM 12.1 | Partial | ❌ Blocked | ✅ All components work |
326
- | Gen speed | ~60 tok/s (Q8) | ~3.5 tok/s (vLLM) | ~30 tok/s (BF16 GGUF) |
327
- | Complexity | Custom script | Full framework | Custom script (simpler) |
328
-
329
- ## Design Q&A
330
-
331
- ### Q: Can llama.cpp read HF model + LoRA directly without GGUF conversion?
332
-
333
- **No.** llama.cpp only reads GGUF format. But the base model is already converted (once, never changes). Only the LoRA adapter needs converting each step — ~100 MB of tensors, takes seconds. No restart needed (hot-swap via `/lora-adapters` endpoint or file overwrite).
334
-
335
- ### Q: llama.cpp uses Q8 — should PyTorch training use Q8 too?
336
-
337
- **No — and this is why we use BF16 GGUF for generation.** Training requires full precision (BF16) for gradient computation. You can't backprop through quantized weights. If we generated with Q8 but trained with BF16, we'd have a base model precision mismatch: the Q8 and BF16 models produce different log-probs for the same input, making the policy ratio `π_new/π_old` noisy. This is a softer version of the same off-policy problem that killed v3.
338
-
339
- **Fix:** Use **BF16 GGUF** (`nemotron-f16.gguf`, ~8 GB) for generation instead of Q8 (~4.5 GB). We have 130 GB unified memory — 8 GB is nothing. Generation is slower (~30 tok/s vs 60 tok/s) but eliminates the precision mismatch entirely. Both llama.cpp generation and PyTorch training see identical BF16 weights + LoRA.
340
-
341
- ### Q: llama.cpp supports LoRA fine-tuning (`llama-finetune`). Why do we need PyTorch?
342
-
343
- Two reasons:
344
- 1. **llama-finetune crashes on NemotronH** — buffer size computation bug for Mamba-2 architecture (tested, produces near-max uint64 allocation). This is a llama.cpp bug, not fundamental.
345
- 2. **llama-finetune only does SFT, not GRPO.** There's no RL algorithm in llama.cpp. GRPO requires computing advantages, policy ratios, KL penalties — that's non-trivial math that PyTorch autograd gives us for free.
346
-
347
- If someone fixed llama-finetune for NemotronH AND implemented GRPO in C++, we wouldn't need PyTorch at all. But that's significant development effort vs using PyTorch's existing autograd.
348
-
349
- ## Risk Assessment
350
-
351
- | Risk | Severity | Mitigation |
352
- |------|----------|------------|
353
- | Log-prob divergence (llama.cpp vs PyTorch) | Low | Both BF16; test measures actual gap |
354
- | torch_forward training is slow | Low | Only for backward pass, not generation |
355
- | Reward function too noisy | Medium | Start with simple heuristics, iterate |
356
- | LoRA GGUF hot-swap has bugs | Low | Test with llama.cpp first |
357
- | 28 min/step too slow for iteration | Low | Reduce batch size for early experiments |
358
-
359
- ---
360
-
361
- ## Gap Test Results (2026-03-24)
362
-
363
- **All 3 tests pass ✅**
364
-
365
- ### Critical Fix: mamba_ssm Mock
366
- NVIDIA's custom `modeling_nemotron_h.py` hard-requires `mamba_ssm` at import time. Since the CUDA kernels don't build on SM 12.1, we mock `mamba_ssm` at import time with:
367
- 1. Fake module hierarchy (`mamba_ssm.ops.triton.layernorm_gated`, etc.)
368
- 2. Real `rmsnorm_fn` implementation copied from [mamba_ssm reference code](https://github.com/state-spaces/mamba/blob/main/mamba_ssm/ops/triton/layernorm_gated.py) — pure PyTorch, uses `einops`
369
- 3. `selective_state_update = None` → model falls back to `torch_forward`
370
-
371
- Also uninstalled broken `causal_conv1d` and `mamba_ssm` packages that had non-functional `.so` files.
372
-
373
- ### Test Results
374
- | Test | Result | Key Metric |
375
- |------|--------|------------|
376
- | Log-prob consistency | ✅ PASS | Perplexity = 4,595 (realistic) |
377
- | torch_forward training | ✅ PASS | Loss = 8.45, 92 LoRA layers with gradients |
378
- | LoRA GGUF tooling | ✅ PASS | llama.cpp `--lora` + `convert_lora_to_gguf.py` ready |
379
-
380
- ### Memory Usage (Test 2)
381
- - Model load: 15.08 GB (BF16, all 42 layers)
382
- - After LoRA: 8.29 GB (PEFT wraps efficiently)
383
- - Peak (forward): 11.69 GB
384
- - After optimizer step: 8.98 GB
385
- - LoRA params: 82.7M / 4.06B total (2.04%)
386
-
387
- ## Smoke Test — PASSED ✅ (2026-03-24 05:32 UTC)
388
-
389
- Full end-to-end pipeline validated with minimal parameters:
390
-
391
- ```
392
- Config: 1 step, 1 prompt, 2 generations, 32 max tokens
393
- ```
394
-
395
- ### Pipeline Execution
396
- | Phase | Time | Status |
397
- |-------|------|--------|
398
- | Model load (HF + LoRA) | 72s | ✅ 82.7M trainable params, 8.29 GB GPU |
399
- | llama.cpp servers start (policy + reference) | 12s | ✅ Both on ports 8090/8091 |
400
- | Generate 2 completions | 5.3s | ✅ avg 28 words |
401
- | Compute rewards | <1s | ✅ mean=2.0/5 |
402
- | Forward + backward | 21.4s | ✅ |
403
- | LoRA save + GGUF convert | 8.0s | ✅ adapter saved, GGUF written |
404
- | Server restart with new LoRA | 6s | ✅ hot-swap works |
405
- | **Total step** | **41s** | ✅ |
406
-
407
- ### Sample Output
408
- ```
409
- Prompt: "I wouldn't say that I do know. In normal social circumstances,
410
- we have evolved mechanisms to keep pe..."
411
- Completion: "And also because the internet is an asynchronous medium, so
412
- there's no way to see what I'm going to do..."
413
- Reward: 2.0/5
414
- ```
415
-
416
- ### Notes
417
- - Loss = 0.0 because both completions got identical reward → GRPO advantage = 0 → no gradient. Expected with single prompt. Real training with diverse prompts will produce non-zero gradients.
418
- - LoRA GGUF conversion uses `convert_lora_to_gguf.py` from llama.cpp — works correctly.
419
- - Server restart cycle (stop → start with new LoRA) takes ~6s. Could optimize with hot-swap API later.
420
-
421
- ---
422
-
423
- *Next: Real training run with proper batch sizes.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V4_POSTMORTEM.md DELETED
@@ -1,129 +0,0 @@
1
- # GRPO v4 Postmortem — PyTorch Mock Training is Broken
2
-
3
- > Date: 2026-03-24
4
- > Status: FAILED — pivoting to Option 2 (SFT Distillation)
5
-
6
- ## What Happened
7
-
8
- GRPO v4 ran 24 steps over 6 hours. Loss oscillated wildly (-6.5 to +8.5), reward stayed flat (~1.5-2.0), grad norms exploded (up to 1692). Diagnostic testing revealed the root cause.
9
-
10
- ## Root Cause: The rmsnorm_fn Mock Produces a Broken Model
11
-
12
- ### The Dependency Chain
13
-
14
- NVIDIA's custom `modeling_nemotron_h.py` imports at module level:
15
-
16
- ```python
17
- from mamba_ssm.ops.triton.layernorm_gated import rmsnorm_fn # line 63
18
- from causal_conv1d import causal_conv1d_fn, causal_conv1d_update # line 68
19
- ```
20
-
21
- Both `mamba_ssm` and `causal_conv1d` contain CUDA/Triton kernels that don't compile on GB10 (SM 12.1). So we mocked them.
22
-
23
- ### What rmsnorm_fn Does
24
-
25
- `rmsnorm_fn` is a **gated RMSNorm** — not standard RMSNorm. It's called inside every Mamba-2 mixer block:
26
-
27
- ```python
28
- # Inside NemotronHMamba2Mixer.torch_forward():
29
- scan_output = self.norm(y, gate) # calls rmsnorm_fn(x=y, z=gate, ...)
30
- ```
31
-
32
- Parameters:
33
- - `x`: hidden states to normalize
34
- - `weight`: learnable scale
35
- - `z`: gate tensor (multiplied via SiLU activation)
36
- - `group_size`: sub-group normalization (key for Mamba-2)
37
- - `norm_before_gate`: whether to norm then gate, or gate then norm
38
-
39
- ### Why the Mock Failed
40
-
41
- Our mock used the reference implementation from `mamba_ssm/ops/triton/layernorm_gated.py`:
42
-
43
- ```python
44
- def rmsnorm_fn(x, weight, z=None, eps=1e-6, group_size=None, norm_before_gate=True):
45
- # ... pure PyTorch reimplementation
46
- ```
47
-
48
- **Diagnostic results:**
49
- - Model generates **50 spaces** instead of text (greedy decoding)
50
- - Per-token log-probs: -4 to -12 (expected: -1 to -3)
51
- - Top-5 predictions at every position: punctuation (`?`, `,`, `:`)
52
- - "Consciousness" ranked **2833rd** when it should be top-10
53
- - CE loss: 8.51 (expected: ~3-4 for a working 4B model)
54
-
55
- The mock produces outputs that are **structurally different** from the real Triton kernel. Possible causes:
56
- 1. **Numerical precision**: Triton kernel uses fused FP32 accumulation; our mock does sequential PyTorch ops in BF16→FP32→BF16
57
- 2. **Group normalization details**: The `group_size` parameter interacts with head dimensions in Mamba-2; slight mishandling corrupts the hidden state
58
- 3. **Stateful Mamba dependencies**: The Mamba-2 scan operation accumulates state across positions. Small norm errors compound across the 38 Mamba layers
59
-
60
- ### Why This Matters
61
-
62
- The mock produced a model that **looks like it loads correctly** (all 263 weights, 42 layers, no errors) but **behaves like a random model**. This is worse than a crash — it silently trains on garbage signal for hours.
63
-
64
- The GRPO training loop was architecturally correct (on-policy generation via llama.cpp, LoRA update, GGUF sync). But the PyTorch training model couldn't produce meaningful log-probs, so the policy gradient signal was noise.
65
-
66
- ## What We Tried (Chronological)
67
-
68
- | Attempt | Blocker |
69
- |---------|---------|
70
- | NeMo RL bare metal | `deep_ep` won't compile on aarch64 |
71
- | NeMo RL Docker | Triton JIT fails on SM 12.1 |
72
- | PyTorch + trust_remote_code | `causal_conv1d` broken .so |
73
- | Uninstall broken packages | `mamba_ssm` hard-required at import |
74
- | Native transformers (no trust_remote_code) | `-` in pattern = `mlp`, but `mlp` not a valid block type |
75
- | Mock mamba_ssm | Model loads but outputs garbage |
76
- | GRPO v4 with mock | 24 steps, loss oscillates, no convergence |
77
-
78
- **Every path to "train Nemotron-4B in PyTorch on GB10" is blocked by the same root cause: the Mamba-2 CUDA kernels don't work on SM 12.1, and there's no correct pure-PyTorch fallback.**
79
-
80
- ## Options Going Forward
81
-
82
- ### Option 1: Different Model Entirely
83
- Pick a standard transformer that works on GB10 (Qwen3.5-4B, Llama, Gemma). No Mamba-2, no custom kernels. Full GRPO via PyTorch + vLLM (which works on the routangseng venv with torch 2.10 + vLLM 0.18.0).
84
-
85
- - **Pro**: Clean slate, everything works natively
86
- - **Con**: Lose the Nemotron-4B base quality (4.35/5 eval score)
87
- - **Effort**: Medium (rewrite reward function, set up training)
88
-
89
- ### Option 2: SFT Distillation via llama.cpp ⭐ RECOMMENDED
90
- Use Nemotron-4B as a **data generator** (via llama.cpp, which works perfectly). Generate thousands of high-quality interviewer completions. SFT a standard transformer on those completions.
91
-
92
- - **Pro**: Uses Nemotron's strength (generation) without fighting its training
93
- - **Pro**: No mocks, no CUDA kernel issues
94
- - **Pro**: SFT is proven on GB10 (we've done it 5 times before)
95
- - **Con**: Still need a trainable student model
96
- - **Effort**: Low-Medium
97
-
98
- Pipeline:
99
- ```
100
- Nemotron-4B (llama.cpp) → Generate 10K+ interview completions
101
- → Filter by reward function (keep score ≥ 4)
102
- → SFT train a standard model (Qwen3.5-4B or similar)
103
- → Evaluate → iterate
104
- ```
105
-
106
- ### Option 3: Cloud GPU for Training
107
- Rent A100/H100 ($1-3/hr), train with NeMo RL properly, deploy on GB10.
108
-
109
- - **Pro**: All CUDA kernels work on SM 8.x/9.x
110
- - **Pro**: NeMo RL designed for this exact use case
111
- - **Con**: Extra cost, setup overhead
112
- - **Effort**: Medium
113
-
114
- ### Option 4: Ship Base Model
115
- Nemotron-4B base already scores 4.35/5. Ship it.
116
-
117
- - **Pro**: Zero effort, already best-in-class
118
- - **Con**: No fine-tuning, no persona customization
119
- - **Effort**: None
120
-
121
- ## Key Lessons
122
-
123
- 1. **A model that loads without errors can still be completely broken.** Weight loading success ≠ correct inference. Always validate generation quality before training.
124
-
125
- 2. **Mocking CUDA kernels is dangerous.** The mock passed all unit tests (forward works, backward works, gradients flow) but produced garbage outputs. Numerical correctness requires exact implementation matching.
126
-
127
- 3. **Know when to stop fighting the hardware.** We spent 24+ hours across multiple approaches trying to make Nemotron-4B trainable on GB10. The hardware (SM 12.1) is simply too new for the Mamba-2 CUDA ecosystem. Use tools that work.
128
-
129
- 4. **Separate generation from training.** llama.cpp handles Nemotron-4B inference perfectly. PyTorch can't. Use each tool for what it's good at.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V7_DESIGN.md DELETED
@@ -1,164 +0,0 @@
1
- # GRPO v7 Design
2
-
3
- Created: 2026-03-26
4
- Status: Run10 live (W&B `wejcyyj5`)
5
-
6
- ---
7
-
8
- ## Problem Statement
9
-
10
- GRPO v3–v6 all failed due to the same root bug: **gen_max_tokens was far too small**.
11
-
12
- Nemotron 4B thinks before answering. A typical generation:
13
- - Thinking phase: 600–1100 tokens
14
- - Answer (interviewer question): 20–100 tokens
15
- - **Minimum needed: ~700–1200 tokens total**
16
-
17
- Previous runs used 300–800 tokens. Result: model was always cut off mid-think, `</think>` never appeared, `strip_thinking()` returned empty string, reward=0 on every completion. GRPO had nothing to learn from.
18
-
19
- ---
20
-
21
- ## v7 Changes vs v6
22
-
23
- | Aspect | v6 | v7 |
24
- |---|---|---|
25
- | Architecture | Full fine-tune (42 layers) | LoRA r=32 (4 attn layers, 1.03% params) |
26
- | gen_max_tokens | 300–800 | 4000 |
27
- | Generation | HF model (broken P(</think>)) | llama.cpp server |
28
- | strip_thinking | regex remove | extract after </think> |
29
- | LR schedule | flat 1e-5 | linear warmup 30 steps → cosine |
30
- | kl_coef | 0.02–0.1 | 0.05 |
31
- | generations | 4–8 | 4 |
32
-
33
- ---
34
-
35
- ## Generation: llama.cpp Server
36
-
37
- The HF model in Python has P(`</think>`) ≈ 0 from the first decode token. This is model behavior — NVIDIA trained it with `selective_state_update` CUDA kernel. Without the real kernel, the Python fallback produces wrong SSM states.
38
-
39
- llama.cpp with the GGUF model works correctly via `reasoning_format: "deepseek"` + `thinking_forced_open: true`. The server returns:
40
- - `content`: the visible answer (the interviewer question)
41
- - `reasoning_content`: the thinking chain
42
-
43
- No stripping needed — llama.cpp handles the `</think>` boundary.
44
-
45
- **Off-policy gap:** GGUF model generates, HF model trains. Same off-policy issue as v3. Acceptable for now; importance sampling correction is planned for v8.
46
-
47
- ---
48
-
49
- ## strip_thinking() — Correct Implementation
50
-
51
- ```python
52
- def strip_thinking(text: str) -> str:
53
- """Extract only the visible answer after </think>."""
54
- if '</think>' in text:
55
- idx = text.index('</think>') + len('</think>')
56
- return text[idx:].strip()
57
- elif '<think>' in text:
58
- # Truncated mid-think — discard entirely
59
- return ''
60
- else:
61
- # No thinking block — use as-is
62
- return text.strip()
63
- ```
64
-
65
- ---
66
-
67
- ## Reward Function
68
-
69
- ```python
70
- def reward_fn(response: str, prompt_context: str) -> float:
71
- """
72
- 5-component heuristic matching the eval_v2 scorer.
73
- Scores 0–5, targeting Lex Fridman interviewer style.
74
- """
75
- score = 0.0
76
-
77
- # 1. Is it a question? (required)
78
- if not response.strip().endswith('?'):
79
- return 0.0
80
- score += 1.0
81
-
82
- # 2. Single question only (no multi-part)
83
- if response.count('?') > 2:
84
- score -= 0.5
85
- elif response.count('?') == 1:
86
- score += 0.5
87
-
88
- # 3. Length (20–60 words ideal)
89
- words = len(response.split())
90
- if 20 <= words <= 60:
91
- score += 1.5
92
- elif words < 20 or words > 120:
93
- score += 0.0
94
- else:
95
- score += 0.75
96
-
97
- # 4. Topical relevance (overlap with guest answer)
98
- guest_words = set(prompt_context.lower().split()) - STOPWORDS
99
- resp_words = set(response.lower().split())
100
- overlap = len(guest_words & resp_words) / max(len(guest_words), 1)
101
- score += min(overlap * 2.0, 1.5)
102
-
103
- # 5. No filler openers ("That's fascinating", "Great point")
104
- filler = ['that\'s fascinating', 'great point', 'interesting', 'that\'s a great',
105
- 'what a', 'absolutely', 'certainly', 'of course']
106
- if any(f in response.lower()[:50] for f in filler):
107
- score -= 0.5
108
-
109
- return max(0.0, min(score, 5.0))
110
- ```
111
-
112
- ---
113
-
114
- ## GRPO Loss
115
-
116
- ```python
117
- # Advantages: normalize within completion group
118
- rewards = torch.tensor([reward_fn(strip_thinking(g), context) for g in generations])
119
- mean_r, std_r = rewards.mean(), rewards.std()
120
- advantages = (rewards - mean_r) / (std_r + 1e-8)
121
-
122
- # Skip step if all rewards identical (zero advantage = zero gradient = mode drift)
123
- if std_r < 0.01:
124
- continue
125
-
126
- # Policy log-probs on visible tokens only (after </think>)
127
- log_probs = compute_log_probs(model, tokenized_completion, visible_mask)
128
- ref_log_probs = compute_log_probs(ref_model, tokenized_completion, visible_mask)
129
-
130
- # GRPO + KL
131
- policy_loss = -(advantages * log_probs).mean()
132
- kl_loss = (log_probs - ref_log_probs).mean()
133
- loss = policy_loss + kl_coef * kl_loss
134
- ```
135
-
136
- **Reference model:** LoRA disabled via `model.disable_adapter_layers()` — same memory footprint as policy, no second copy needed.
137
-
138
- ---
139
-
140
- ## run10 Results (first 2 steps)
141
-
142
- ```
143
- Step 0: reward mean=1.928 std=1.932
144
- [gen 1] "When you say the self disappears during meditation,
145
- how does that experience feel different from ordinary states of mind?" → 4.04/5
146
-
147
- Step 1: reward mean=0.980 std=1.697
148
- [gen 3] "What do you think the most profound consequence of unregulated genetic
149
- selection for intelligence might be, beyond the obvious?" → 3.92/5
150
- ```
151
-
152
- Step times: ~89s (loss) + ~73s (generation) = ~162s/step.
153
- ETA for 500 steps: ~22h from launch (completes ~20:00 UTC Mar 27).
154
-
155
- ---
156
-
157
- ## Planned: GRPO v8 (on-policy)
158
-
159
- With `mamba-ssm` now installed (real CUDA kernel), the HF model should produce correct P(`</think>`). v8 will:
160
-
161
- 1. Generate with HF model directly (fully on-policy)
162
- 2. Drop llama.cpp server dependency
163
- 3. Add importance sampling correction for any remaining distribution gap
164
- 4. Mix 10% `enable_thinking=False` samples (NVIDIA recipe)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V8_CHANGES.md DELETED
@@ -1,55 +0,0 @@
1
- # GRPO v8 — Speed Optimizations (2026-03-27)
2
-
3
- ## Problem
4
- v8 train phase averaging **183s/step** due to 64 sequential forward passes.
5
-
6
- ## Root Cause
7
- Both sweep 1 (ref logprobs) and sweep 2 (policy logprobs) called
8
- `compute_logprobs_batched(model, tok, [single_item], ...)` in a loop —
9
- wasting 32 kernel launch cycles each sweep.
10
-
11
- ## Fixes Applied
12
-
13
- ### 1. Sweep 1 — micro-batched (was 1-at-a-time)
14
- ```python
15
- # Before: 32 sequential single-item calls
16
- for (pt, tids, _, _) in flat_items:
17
- rlp = compute_logprobs_batched(model, tok, [pt], [tids], ...)
18
-
19
- # After: 4 micro-batches of 8
20
- for mb_start in range(0, n_total, cfg.micro_batch_size):
21
- mb = flat_items[mb_start : mb_start + cfg.micro_batch_size]
22
- rlps = compute_logprobs_batched(model, tok, [it[0] for it in mb], ...)
23
- ```
24
- **Why not full batch=32?** Padding to max_seq_len=5000 would need 41GB activations
25
- (vs 81GB free). At typical lengths (~800 tokens), batch=8 → 1.7GB only.
26
-
27
- ### 2. Sweep 2 — micro-batched (was 1-at-a-time)
28
- Same pattern. Gradients accumulate correctly across micro-batches since
29
- each item's loss is divided by n_total before `.backward()`.
30
-
31
- ### 3. vLLM GPU utilization: 0.30 → 0.45
32
- Gives vLLM 58GB (up from 38GB) for better KV cache — faster long-seq generation.
33
-
34
- ### 4. LoRA targets: +in_proj, +out_proj
35
- Adds 38 Mamba SSM layers to the trainable set (was only 4 attention layers + MLP).
36
-
37
- ## Expected Impact
38
- | Phase | Before | After |
39
- |-------|--------|-------|
40
- | Train (sweep 1+2) | 183s | ~80s |
41
- | Generation | 152s | ~120s |
42
- | **Total/step** | **335s** | **~200s** |
43
-
44
- ## Validation Trend (v8 run5, before optimization)
45
- - Step 80: 3.42/10
46
- - Step 90: 3.48/10
47
- - Step 100: 3.56/10
48
- - Step 110: 4.02/10 ← accelerating
49
-
50
- ## Launch Command
51
- ```bash
52
- cd /home/bobber/lex-ft
53
- tmux new-session -d -s grpo_v8 './launch_grpo_v8.sh grpo-v8-run6 2>&1'
54
- tmux attach -t grpo_v8
55
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V8_ONPOLICY_PLAN.md DELETED
@@ -1,320 +0,0 @@
1
- # GRPO v8 — On-Policy Training Plan with vLLM
2
-
3
- Created: 2026-03-27
4
- Reference: [Nemotron 3 Nano RL Recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/nano-3-training/docs/nemotron/nano3)
5
- Status: Design phase (v7 run10 terminated — was off-policy, blocked)
6
-
7
- ---
8
-
9
- ## Why v7 Failed (Off-Policy)
10
-
11
- GRPO v7 used llama.cpp GGUF model for generation and HF BF16 model for training.
12
- These are different model representations with different probability distributions.
13
- Result: training rewards looked positive but were measuring the wrong model's distribution.
14
-
15
- This is the same off-policy gap that broke GRPO v3 in March. We knew about it. v7 was a stepping stone.
16
-
17
- **v8 closes this gap completely**: vLLM generates from the exact same HF weights being trained.
18
-
19
- ---
20
-
21
- ## NVIDIA's Approach (Nemotron 3 Nano Cookbook)
22
-
23
- From the [NeMo-RL GRPO recipe](https://docs.nvidia.com/nemo/rl/latest/guides/grpo.html):
24
-
25
- ```
26
- Generate responses from the current policy using vLLM
27
- → Evaluate using NeMo-Gym reward environments
28
- → Compute group-relative advantages per prompt
29
- → Update policy with clipped gradients (PPO-style)
30
- ```
31
-
32
- **Key parameters from NVIDIA's production run:**
33
- - `num_prompts_per_step`: 128
34
- - `num_generations_per_prompt`: 16
35
- - `max_total_sequence_length`: 49152 (~49K)
36
- - `ratio_clip_min`: 0.2, `ratio_clip_max`: 0.28 (asymmetric)
37
- - `use_on_policy_kl_approximation`: true
38
- - `use_importance_sampling_correction`: true
39
- - `lr`: 3e-6 (lower than our lr=2e-5 — model is post-SFT when they run RL)
40
- - `normalize_rewards`: true
41
- - `use_leave_one_out_baseline`: true (variance reduction)
42
- - `token_level_loss`: true
43
- - `reference_policy_kl_penalty`: 0 (KL disabled — they use importance sampling instead)
44
- - Reasoning on/off: 10% of samples use `enable_thinking=False`
45
-
46
- **Our adaptations** (single GB10, LoRA, style task not verifiable):
47
- - 4–8 prompts/step (vs 128 — memory limited)
48
- - 4 generations/prompt (vs 16)
49
- - max_tokens: 4000 (vs 49K — our thinking chains are shorter)
50
- - lr: 3e-6 with 30-step warmup
51
- - KL penalty: 0.05 (light anchor — we don't have a GenRM)
52
- - Reference: disabled-LoRA (saves memory vs separate frozen model)
53
-
54
- ---
55
-
56
- ## Architecture
57
-
58
- ```
59
- ┌─────────────────────────────────────────────────────────────────┐
60
- │ GRPO v8 Per-Step Loop │
61
- │ │
62
- │ 1. GENERATE (vLLM, .venv-vllm) │
63
- │ vLLM loads current LoRA weights from shared path │
64
- │ generate(prompts × 4, max_tokens=4000, enable_thinking=T) │
65
- │ → returns: thinking_tokens, answer_tokens, token_ids │
66
- │ │
67
- │ 2. REWARD (.venv-train) │
68
- │ reward_fn(answer) → score 0–5 │
69
- │ advantages = (rewards - mean) / (std + 1e-8) │
70
- │ skip step if std < 0.01 (mode collapse detection) │
71
- │ │
72
- │ 3. IMPORTANCE WEIGHTS (.venv-train) │
73
- │ vllm_logprobs = token log-probs from vLLM output │
74
- │ policy_logprobs = HF model forward (current weights) │
75
- │ ratio = exp(policy_logprobs - vllm_logprobs) │
76
- │ (ratio ≈ 1.0 if vLLM and HF weights are in sync) │
77
- │ │
78
- │ 4. LOSS (.venv-train) │
79
- │ clipped_ratio = clip(ratio, 1-ε, 1+ε) [ε=0.2] │
80
- │ policy_loss = -mean(min(ratio, clipped_ratio) × adv) │
81
- │ kl_loss = mean(policy_logprobs - ref_logprobs) │
82
- │ loss = policy_loss + 0.05 × kl_loss │
83
- │ │
84
- │ 5. UPDATE (.venv-train) │
85
- │ loss.backward(); optimizer.step() │
86
- │ save LoRA weights to shared path │
87
- └─────────────────────────────────────────────────────────────────┘
88
- ```
89
-
90
- ---
91
-
92
- ## Implementation Plan
93
-
94
- ### Phase 1: Shared Weight Protocol
95
-
96
- vLLM and HF model must share the same weights. Options:
97
-
98
- **Option A: Re-load vLLM each step** (simple, slow)
99
- - Save merged weights after each gradient update
100
- - Re-initialize vLLM `LLM()` from new checkpoint
101
- - Problem: 80s load time per step = impractical
102
-
103
- **Option B: vLLM weight sync API** (fast, complex)
104
- - vLLM 0.18 has `llm.llm_engine.model_executor.driver_worker.model_runner.model.load_weights()`
105
- - After each training step, sync LoRA deltas to vLLM in-place
106
- - No re-load needed — sub-second sync
107
-
108
- **Option C: Separate processes with weight file** (recommended)
109
- - Training process saves LoRA checkpoint to `/tmp/grpo_v8_lora/`
110
- - Generation process polls for new checkpoint, loads it before each generation
111
- - Clean separation: vLLM doesn't fight with training CUDA allocator
112
- - Step flow: train → save checkpoint → signal → generate → train...
113
-
114
- We'll use **Option C** for v8. Cleaner than fighting CUDA memory between vLLM and the optimizer.
115
-
116
- ### Phase 2: Importance Sampling
117
-
118
- When vLLM and HF model are synced, `ratio ≈ 1.0` for all tokens. But there will always be a small discrepancy because:
119
- - vLLM uses its own Mamba-2 kernel (not HF's `NemotronH.forward`)
120
- - Different chunking/precision in SSM computation
121
-
122
- Following NVIDIA's recipe: use `use_importance_sampling_correction=True`.
123
-
124
- ```python
125
- # vLLM returns token log-probs for each generated token
126
- vllm_logprobs = outputs[0].outputs[0].logprobs # list of {token_id: logprob}
127
-
128
- # HF model recomputes log-probs on same token sequence
129
- hf_logprobs = compute_log_probs(model, token_ids)
130
-
131
- # Importance ratio
132
- ratio = torch.exp(hf_logprobs - vllm_logprobs.detach())
133
- # ratio = 1.0 for perfectly synced weights, slightly off otherwise
134
- ```
135
-
136
- ### Phase 3: Overlong Filtering
137
-
138
- From NVIDIA's recipe: *"Excludes sequences that hit max length without EOS from loss computation"*
139
-
140
- ```python
141
- # Filter out truncated completions from loss
142
- hit_max = [len(o.outputs[0].token_ids) >= max_tokens for o in outputs]
143
- valid_mask = [not h for h in hit_max]
144
- # Only compute loss on completions that actually finished
145
- ```
146
-
147
- ### Phase 4: Reasoning On/Off Mixing
148
-
149
- From NVIDIA's recipe: *"Strip reasoning from 10% of samples"*
150
-
151
- ```python
152
- # 10% of prompts use enable_thinking=False
153
- use_thinking = random.random() > 0.1
154
- prompt = tok.apply_chat_template(
155
- msgs, add_generation_prompt=True, enable_thinking=use_thinking)
156
- ```
157
-
158
- This teaches the model both modes simultaneously, preventing reasoning collapse.
159
-
160
- ---
161
-
162
- ## Training Script: grpo_v8_train.py
163
-
164
- ### Key differences from v7:
165
-
166
- | Aspect | v7 | v8 |
167
- |---|---|---|
168
- | Generation | llama.cpp server (GGUF) | vLLM (.venv-vllm, HF weights) |
169
- | On-policy | ❌ (different model) | ✅ (same weights) |
170
- | Importance sampling | ❌ | ✅ |
171
- | PPO clipping | ❌ (vanilla PG) | ✅ (ε=0.2) |
172
- | Overlong filtering | ❌ | ✅ |
173
- | Thinking mixing | ❌ | ✅ (10% no-think) |
174
- | Weight sync | N/A | checkpoint every step |
175
- | Reference model | disabled-LoRA | disabled-LoRA (unchanged) |
176
-
177
- ### Process architecture:
178
-
179
- ```python
180
- # Two-process design
181
- # Process A: Training (.venv-train)
182
- # - Loads HF model with LoRA
183
- # - Receives generated token IDs from process B
184
- # - Computes loss, updates weights
185
- # - Saves LoRA delta to /tmp/grpo_v8_lora/step_N/
186
-
187
- # Process B: Generation (.venv-vllm)
188
- # - Loads vLLM with current LoRA weights
189
- # - Waits for generation requests (ZMQ socket or file polling)
190
- # - Generates completions, returns token_ids + logprobs
191
- # - Polls /tmp/grpo_v8_lora/ for weight updates after each step
192
- ```
193
-
194
- Simpler alternative: **single-process sequential** (no parallelism, but correct):
195
- ```
196
- for each step:
197
- 1. vLLM generate (subprocess call, returns JSON results)
198
- 2. Free vLLM GPU memory
199
- 3. Load HF model
200
- 4. Compute loss + update
201
- 5. Save LoRA checkpoint
202
- 6. Repeat
203
- ```
204
- Memory cost: must load/unload model every step (~15s each). For 500 steps = ~2.5h overhead.
205
-
206
- **Recommended: two processes with shared memory or file-based IPC**. One process holds vLLM, other holds HF model. They communicate via files. CUDA contexts don't conflict (CUDA supports multiple contexts per GPU).
207
-
208
- ---
209
-
210
- ## Reward Function — No Changes Needed
211
-
212
- The v7 reward function works. The key insight from NVIDIA's recipe: they use *verifiable* rewards (math correctness, code execution). We can't do that for style tasks.
213
-
214
- Our heuristic (5 components, 0–5 scale) is the best we can do without a learned reward model. The baseline eval shows 4.35/5 for the base model — our reward ceiling.
215
-
216
- **Possible enhancement: GenRM-style reward** (future work)
217
- - Use the base model to score completions against the prompt
218
- - "Does this sound like Lex Fridman?" → scored by the model itself
219
- - Circular comparison: generate N, score each against the others
220
- - Expensive but avoids heuristic gaming
221
-
222
- ---
223
-
224
- ## Hyperparameters
225
-
226
- ```yaml
227
- # grpo_v8_config.yaml — adapted from Nemotron 3 Nano cookbook
228
-
229
- generation:
230
- num_prompts_per_step: 4 # vs NVIDIA's 128 (memory limited)
231
- num_generations_per_prompt: 4 # vs NVIDIA's 16
232
- max_tokens: 4000 # full thinking chain budget
233
- temperature: 1.0
234
- top_p: 0.95
235
- enable_thinking_fraction: 0.9 # 10% no-think mixing
236
- overlong_filter: true # exclude truncated from loss
237
-
238
- loss:
239
- ratio_clip_min: 0.2 # from NVIDIA recipe
240
- ratio_clip_max: 0.28 # asymmetric clipping
241
- use_importance_sampling: true # correct vLLM/HF mismatch
242
- kl_coef: 0.05 # light KL anchor
243
- token_level_loss: true # per-token normalization
244
- use_leave_one_out_baseline: true # variance reduction
245
-
246
- optimizer:
247
- type: AdamW
248
- lr: 3e-6 # from NVIDIA recipe (post-SFT RL)
249
- min_lr: 3e-7
250
- weight_decay: 0.0
251
- clip_grad: 1.0
252
- warmup_steps: 30
253
-
254
- lora:
255
- r: 32
256
- alpha: 64
257
- targets: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
258
- # Note: targets only attention layers (4/25 layers in this model)
259
-
260
- training:
261
- max_steps: 500
262
- val_period: 10
263
- save_period: 50
264
- zero_std_skip: true # skip if all rewards identical
265
- ```
266
-
267
- ---
268
-
269
- ## Validation Plan
270
-
271
- Every 10 steps, run 5 fixed eval prompts and log to W&B:
272
-
273
- ```python
274
- EVAL_PROMPTS = [
275
- ("Andrej Karpathy", "Neural networks are simple."),
276
- ("Elon Musk", "I think the biggest risk is AI."),
277
- ("A physicist", "Time might not be fundamental."),
278
- ("A philosopher", "Free will is an illusion."),
279
- ("A jazz musician", "The best songs write themselves."),
280
- ]
281
- ```
282
-
283
- Track: `eval/mean_score`, `eval/pct_questions`, `eval/mean_length`, `eval/think_end_rate`.
284
-
285
- **Success criteria**: `eval/mean_score > 4.0/5` consistently across steps → better than base model baseline.
286
-
287
- ---
288
-
289
- ## Expected Timeline
290
-
291
- | Step | Action | ETA |
292
- |---|---|---|
293
- | 0 | Write `grpo_v8_train.py` | 1–2h |
294
- | 1 | Smoke test (5 steps, check losses/rewards are sane) | 30min |
295
- | 2 | Full run (500 steps) | ~18h at ~130s/step |
296
- | 3 | Eval vs base model | 30min |
297
- | 4 | If >4.35/5: merge LoRA → GGUF → push to HF | 1h |
298
-
299
- ---
300
-
301
- ## Previous GRPO Postmortems
302
-
303
- | Version | Status | Root Cause |
304
- |---|---|---|
305
- | v3 | gibberish | Off-policy (llama.cpp gen + HF train), no KL |
306
- | v4 | oscillating loss | rmsnorm mock silently broke forward pass |
307
- | v5 | not trained | batched gen had EOS detection bug |
308
- | v6 runs 1–10 | collapse | gen_max_tokens 300–800 (mid-think truncation → reward=0) |
309
- | v7 run10 | stuck/blocked | Off-policy (llama.cpp GGUF ≠ HF BF16), stalled at step 1 |
310
- | **v8** | **planned** | On-policy vLLM + importance sampling correction |
311
-
312
- ---
313
-
314
- ## References
315
-
316
- - [Nemotron 3 Nano RL Guide](https://github.com/NVIDIA-NeMo/Nemotron/blob/nano-3-training/docs/nemotron/nano3/rl.md)
317
- - [NeMo-RL GRPO Documentation](https://docs.nvidia.com/nemo/rl/latest/guides/grpo.html)
318
- - [Tech Report Section 3.2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf)
319
- - [vLLM Setup Notes](./VLLM_SETUP_NOTES.md)
320
- - [GRPO v3 Postmortem](./GRPO_V3_POSTMORTEM.md) — the original off-policy diagnosis
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GRPO_V8_TRAINING_FLOW.md DELETED
@@ -1,184 +0,0 @@
1
- # GRPO v8 — End-to-End Training Flow
2
-
3
- > **Run6 validation scores**: Step 10 → 6.82/10, Step 20 → 7.64/10
4
- > (vs run5 reaching 4.02/10 at step 110 — Mamba LoRA is the difference)
5
-
6
- ---
7
-
8
- ## High-Level Architecture
9
-
10
- Two models coexist in GPU memory every step:
11
-
12
- ```
13
- ┌─────────────────────────────────────────────────────────┐
14
- │ vLLM (38GB) ──────→ fast completions │
15
- │ frozen base weights ──────→ + per-token log-probs │
16
- └─────────────────────────────────────────────────────────┘
17
- ↓ sync LoRA every step
18
- ┌─────────────────────────────────────────────────────────┐
19
- │ HF model (8GB) ──────→ ref logprobs (LoRA off) │
20
- │ + LoRA (41M params) ──────→ policy logprobs + grads │
21
- └─────────────────────────────────────────────────────────┘
22
- ```
23
-
24
- One training step = **8 prompts × 4 completions = 32 completions**.
25
-
26
- ---
27
-
28
- ## Phase 1: Generation (vLLM, ~80–140s)
29
-
30
- ```
31
- prompt = [system: "You are Lex Fridman..."]
32
- + [user: <guest's previous statement>]
33
-
34
- completions = vLLM.generate(prompt, n=4, max_tokens=4000, temp=0.9)
35
-
36
- # vLLM also returns per-token log-probs (used for IS correction):
37
- vllm_lp[t] = log P_vllm(token_t | token_1..t-1, prompt)
38
- ```
39
-
40
- vLLM uses CUDA graphs → 150–250 tok/s vs ~30–50 tok/s from a plain HF model.
41
-
42
- ---
43
-
44
- ## Phase 2: Reward (CPU, ~1s)
45
-
46
- Each completion gets a scalar reward from `reward_v8.reward_fn_group()`.
47
- **Structural scoring only — no LLM judge, ungameable by keywords.**
48
-
49
- ```
50
- r(completion) = base_score - penalties + diversity_bonus
51
-
52
- base_score:
53
- + ends_with_question? → +1.5 (Lex always asks questions)
54
- + is_open_question? → +0.5 (not yes/no binary)
55
- + sufficient_length? → +0.5 (≥8 content words)
56
- + has_pivot? → +0.5 (reframes guest's statement)
57
- + lexical_diversity? → +0.5 (unique content words)
58
-
59
- penalties:
60
- - filler_opener? → -2.0 ("that's fascinating", "great point"...)
61
- - collapse_template? → -3.0 ("as Lex Fridman", "the interview"...)
62
- - yes_no_question? → -1.5 (starts with "Is/Are/Was/Do...")
63
- - parrots_prompt? → -1.0 (>50% token overlap with prompt)
64
-
65
- diversity_bonus (group-level, across all 4 completions for same prompt):
66
- + unique angles explored → up to +0.5
67
- ```
68
-
69
- ---
70
-
71
- ## Phase 3: Sweep 1 — Reference Logprobs (HF, no grad)
72
-
73
- LoRA **disabled** → model acts as frozen reference policy:
74
-
75
- ```
76
- for each completion (micro-batched, 8 at a time):
77
- ref_lp[t] = log P_ref(token_t | context) # no_grad
78
- ```
79
-
80
- Memory: batch=8 × ~800 tokens × 42 layers × bf16 ≈ 1.7GB activations → safe.
81
-
82
- This is the **KL anchor** — prevents the policy from drifting too far from the base model.
83
-
84
- ---
85
-
86
- ## Phase 4: Sweep 2 — Policy Gradient (HF, with grad, 1-at-a-time)
87
-
88
- LoRA **enabled** → learner computes gradients:
89
-
90
- ```
91
- policy_lp[t] = log P_θ(token_t | context) # with grad
92
- ```
93
-
94
- **Step 1: Advantage (leave-one-out baseline)**
95
-
96
- ```
97
- A_i = (r_i - mean_{j≠i}(r_j)) / (std(r_1..N) + ε)
98
- ```
99
-
100
- Centering within the group of 32 completions normalizes away prompt difficulty.
101
-
102
- **Step 2: Importance Sampling (IS) ratio**
103
-
104
- Corrects for the fact that vLLM generated from step-N weights while we're now at step-N+ε:
105
-
106
- ```
107
- ρ_t = exp(policy_lp[t] - vllm_lp[t])
108
- ρ_t = clamp(ρ_t, 0, 10)
109
- ```
110
-
111
- - ρ ≈ 1.0 → policies agree, use gradient as-is
112
- - ρ >> 1 → current policy favors this token more → clip
113
- - ρ << 1 → policies drifted apart → downweight
114
-
115
- **Step 3: Asymmetric PPO clipping** (NVIDIA RL cookbook)
116
-
117
- ```
118
- ρ̃_t = clamp(ρ_t, 1 - ε_low, 1 + ε_high)
119
- where ε_low=0.2, ε_high=0.28 # asymmetric: more conservative going up
120
-
121
- pg_loss_i = -mean_t[ min(ρ_t · A_i, ρ̃_t · A_i) ]
122
- ```
123
-
124
- **Step 4: KL penalty**
125
-
126
- ```
127
- kl_loss_i = mean_t[ policy_lp[t] - ref_lp[t] ]
128
- ```
129
-
130
- **Step 5: Per-completion loss and accumulation**
131
-
132
- ```
133
- L_i = pg_loss_i + β · kl_loss_i # β = 0.05
134
-
135
- L_total = (1/N) · Σ_i L_i # N=32
136
-
137
- (L_i / N).backward() # accumulate gradient per completion
138
- optimizer.step() # AdamW 8-bit
139
- ```
140
-
141
- ---
142
-
143
- ## Why It's "On-Policy"
144
-
145
- After each optimizer step, the updated LoRA weights are synced into vLLM:
146
-
147
- ```
148
- vllm.load_weights(adapter_model.safetensors)
149
- ```
150
-
151
- So generation at step N+1 uses the policy from step N. The IS ratio `ρ_t` measures the drift — at run6 step 0 it showed `ratio_mean=0.9991, clipped=0.5%`, confirming near-perfect on-policy behavior.
152
-
153
- ---
154
-
155
- ## Comparison: lex-ft v8 vs routangseng phase8
156
-
157
- | | routangseng phase8 (TRL GRPOTrainer) | lex-ft v8 |
158
- |---|---|---|
159
- | **Generation model** | Same HF model (all weights) | vLLM (frozen base, no LoRA) |
160
- | **Generation speed** | ~30–50 tok/s | **150–250 tok/s** (CUDA graphs) |
161
- | **Training model** | Same model (LoRA on) | Separate HF model (LoRA on) |
162
- | **LoRA targets** | q/k/v/o/gate/up/down (attn+MLP) | + **in_proj/out_proj (38 Mamba layers)** |
163
- | **Trainable params** | ~0.4% | **1.03%** (41M params) |
164
- | **Reference policy** | Implicit in TRL | Explicit: LoRA disabled on same model |
165
- | **IS correction** | None (always on-policy) | `ρ_t = exp(policy_lp − vllm_lp)` |
166
- | **Architecture** | Transformer (standard) | Hybrid Mamba-2 + 4 Attention layers |
167
-
168
- **The key difference in practice:** routangseng's LoRA only trained the 4 attention layers and MLP projections — the 38 Mamba SSM layers (which handle ~90% of the sequence processing) were frozen. lex-ft v8 adds `in_proj`/`out_proj` to reach those layers, giving the recurrent state space model a chance to actually learn the interviewer style. This is why run6 reached **7.64/10 at step 20** vs run5 reaching **4.02/10 at step 110**.
169
-
170
- ---
171
-
172
- ## What IS (Importance Sampling) Means
173
-
174
- IS bridges the gap between the "generator" (vLLM, slightly stale) and "learner" (HF+LoRA, current):
175
-
176
- ```
177
- ρ_t = P_current(token_t) / P_vllm(token_t)
178
- ```
179
-
180
- Without IS, you'd be treating stale completions as if they were generated by the current policy — which can cause instability when LoRA updates are large. The PPO clipping then ensures we don't take gradient steps larger than the data supports.
181
-
182
- ---
183
-
184
- *Last updated: 2026-03-28 | Run6 step 27 ongoing*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/KAGGLE_VS_OURS_COMPARISON.md DELETED
@@ -1,106 +0,0 @@
1
- # Kaggle Notebook vs Our Approach — Deep Comparison
2
-
3
- ## Key Difference: 30B vs 4B Model
4
-
5
- The Kaggle competition uses **30B-A3B** (`nemotron-3-nano-30b-a3b-bf16`).
6
- We use **4B** (`NVIDIA-Nemotron-3-Nano-4B-BF16`).
7
-
8
- This matters because:
9
-
10
- ```
11
- 30B pattern: MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME
12
- 4B pattern: M-M-M-MM-M-M*-M-M*-M-M-M*-M-M-MM*-MMM-M-M-
13
- ```
14
-
15
- - 30B uses `M` (mamba), `E` (moe), `*` (attention) — all recognized by both transformers 4.x custom code AND transformers 5.x native code
16
- - 4B uses `-` (mlp) — ONLY recognized by transformers 4.x custom code. Native transformers 5.x chokes on it.
17
-
18
- **This is why the Kaggle code works with transformers 5.x and ours doesn't.** It's not a code difference — it's a model architecture difference.
19
-
20
- ## Detailed Comparison
21
-
22
- | Aspect | Kaggle (dennisfong) | Our Approach |
23
- |--------|-------------------|--------------|
24
- | **Model** | 30B-A3B (52 layers) | 4B (42 layers) |
25
- | **transformers** | 5.x (Kaggle env) | 4.48.3 (pinned) |
26
- | **torch** | 2.10.0 | 2.11.0+cu130 |
27
- | **GPU** | Kaggle (Blackwell) | GB10 (Blackwell SM 12.1) |
28
- | **rmsnorm_fn** | Pure PyTorch mock | Pure PyTorch mock (same approach) |
29
- | **is_fast_path_available** | Forced False after load | Mock makes it False at import |
30
- | **causal_conv1d** | Not explicitly mocked | Mocked |
31
- | **Training method** | SFTTrainer (trl 0.24) | Custom GRPO loop |
32
- | **LoRA rank** | 32 | 64 |
33
- | **LoRA alpha** | 16 | 256 |
34
- | **Learning rate** | 2e-4 | 5e-6 |
35
- | **Max seq len** | 1024 | 200 (max_new_tokens) |
36
- | **Gradient checkpointing** | Yes (use_reentrant=True) | No |
37
- | **trust_remote_code** | True | True |
38
-
39
- ## Kaggle's rmsnorm_fn — Simpler Than Ours
40
-
41
- ```python
42
- # Kaggle version — does NOT handle group_size or norm_before_gate correctly
43
- def _pure_rmsnorm_fn(x, weight, bias=None, z=None, eps=1e-5,
44
- group_size=None, norm_before_gate=True, upcast=True):
45
- dtype = x.dtype
46
- if upcast: x = x.float()
47
- variance = x.pow(2).mean(-1, keepdim=True)
48
- x_normed = x * torch.rsqrt(variance + eps)
49
- out = x_normed * weight.float()
50
- if bias is not None: out = out + bias.float()
51
- if z is not None: out = out * F.silu(z.float()) # always gate AFTER norm
52
- return out.to(dtype)
53
- ```
54
-
55
- Problems with Kaggle's version:
56
- 1. **Ignores `group_size`** — does full-dimension RMSNorm even when group normalization is requested
57
- 2. **Ignores `norm_before_gate` parameter** — always applies gate after norm, ignoring the flag
58
- 3. Works for the 30B because the 30B may not use group_size, or the error is small enough for SFT
59
-
60
- Our version handles both correctly (from the mamba_ssm reference implementation).
61
-
62
- ## Kaggle's Key Trick: Post-Load Patch
63
-
64
- ```python
65
- # Force slow path AFTER model is loaded
66
- for name, mod in sys.modules.items():
67
- if "modeling_nemotron_h" in name:
68
- mod.is_fast_path_available = False
69
- ```
70
-
71
- This is important because with `trust_remote_code=True`, the model code imports `mamba_ssm` at module level. If the import succeeds (even with a mock), `is_fast_path_available` may be set to `True` if all the functions are non-None. The Kaggle code forces it False AFTER loading to ensure `torch_forward` is used.
72
-
73
- ## Kaggle's Environment
74
-
75
- Key detail: Kaggle provides a **custom Python environment** with specific Blackwell GPU patches:
76
- - Custom ptxas-blackwell binary
77
- - Triton backend patches
78
- - Pre-installed torch 2.10.0 with cu128
79
- - The Kaggle environment has `mamba_ssm` and `causal_conv1d` pre-installed (they just bypass the Triton kernels)
80
-
81
- This is NOT a vanilla transformers 5.x setup — it's a Kaggle-specific environment with hardware-specific patches.
82
-
83
- ## What We Should Take From Kaggle
84
-
85
- 1. ✅ **Pure PyTorch rmsnorm_fn** — already doing this
86
- 2. ✅ **Force `is_fast_path_available = False`** — should add explicitly
87
- 3. 🔄 **Gradient checkpointing with `use_reentrant=True`** — would save memory
88
- 4. 🔄 **SFTTrainer from trl** — cleaner than our custom loop for SFT tasks
89
- 5. ❌ **Their rmsnorm ignoring group_size** — we should NOT copy this (ours is more correct)
90
-
91
- ## Recommendation
92
-
93
- **Stick with transformers 4.48.3 for the 4B model.** Here's why:
94
-
95
- 1. The 4B model's `-` pattern is only understood by the custom code in transformers 4.x
96
- 2. Transformers 5.x native NemotronH doesn't support `mlp` as a block type
97
- 3. The Kaggle notebook works with 5.x because the 30B model has a different pattern
98
- 4. Our 4.48.3 setup produces correct outputs (CE loss 3.88, coherent generation)
99
- 5. Training is already running and producing results
100
-
101
- **If we wanted transformers 5.x**, we'd need to:
102
- - Modify the 4B model's `config.json` to change the pattern format
103
- - OR fix native transformers to support `mlp` block type
104
- - Both are more work than just using 4.48.3
105
-
106
- The Kaggle approach is valid for the 30B but doesn't apply to our 4B model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/LEXFRIDMAN_INTERVIEWER_PLAN.md DELETED
@@ -1,183 +0,0 @@
1
- # Lex Fridman AI Interviewer — Project Plan
2
-
3
- ## Goal
4
-
5
- Fine-tune **NVIDIA Nemotron 3 Nano 4B** (a hybrid Mamba-2 + Attention architecture) into an AI interviewer that conducts conversations in the style of Lex Fridman. The model should:
6
-
7
- 1. **Ask thoughtful, concise follow-up questions** — not lecture, summarize, or monologue
8
- 2. **Reference the guest's expertise and prior statements** — show it's actually listening
9
- 3. **Stay brief** — Lex's questions are typically 30–80 words, not paragraphs
10
- 4. **Avoid filler** — no "Great question!", no generic transitions
11
- 5. **Match Lex's intellectual curiosity** — deep, sometimes philosophical, always genuine
12
-
13
- The target is a locally-deployable GGUF model served via llama.cpp that outperforms the base model on our interviewer eval benchmark.
14
-
15
- ## Eval Leaderboard
16
-
17
- ### v2 Eval (0-10 scale, 25 scenarios, 10 dimensions)
18
-
19
- | # | Model | Score | Notes |
20
- |---|-------|-------|-------|
21
- | 1 | **Nemotron 4B base** | **6.90/10** | No training — still the best |
22
- | 2 | Full SFT v4 (2ep, masked) | 5.54/10 | Best SFT — completion-only loss masking |
23
- | 3 | Full SFT v5 (3ep, packed) | 5.17/10 | 3rd epoch hurt (overfitting) |
24
- | 4 | Full SFT v3 (2ep, v4 data) | 3.96/10 | No loss masking → user prefix contamination |
25
-
26
- ### v1 Eval (0-5 scale, 20 scenarios, 5 dimensions)
27
-
28
- | # | Model | Score |
29
- |---|-------|-------|
30
- | 1 | Nemotron 4B base Q8 | 4.35/5 |
31
- | 2 | GPT-5.4 | 4.30/5 |
32
- | 3 | Full SFT v4 | 3.85/5 |
33
- | 4 | Full SFT v2 | 3.20/5 |
34
- | 5 | Full SFT v3 | 2.50/5 |
35
- | 6 | LoRA SFT v1 | 2.10/5 |
36
- | 7 | Full SFT v1 | 2.00/5 |
37
-
38
- ## SFT Verdict: Dead End
39
-
40
- **SFT causes catastrophic forgetting on this model.** It learns surface patterns (brevity +0.8, no filler +0.2) but destroys deeper capabilities:
41
- - guest_reference: 6.0 → 1.5 (-4.5)
42
- - specificity: 7.0 → 3.1 (-3.9)
43
- - interview_flow: 8.2 → 6.4 (-1.8)
44
- - depth: 6.1 → 4.7 (-1.4)
45
-
46
- 3rd epoch made it worse (5.54 → 5.17), confirming overfitting on 6,335 samples.
47
-
48
- ## Current Phase: GRPO (Reinforcement Learning)
49
-
50
- ### Strategy
51
- - Start from **base model** (6.90/10) — preserve all strengths
52
- - Use **LoRA targeting ALL 42 layers** (not just 4 attention layers)
53
- - Reward function targets weak dimensions without destroying strong ones
54
- - Anti-gaming measures prevent reward hacking
55
-
56
- ### LoRA Configuration (rank=32)
57
-
58
- **Key finding: LoRA can target ALL projections across all 42 layers, not just attention.**
59
-
60
- | Module | Layers | Shape | Dim Sum |
61
- |--------|--------|-------|---------|
62
- | in_proj (Mamba) | 21 | (17504, 3136) | 433,440 |
63
- | out_proj (Mamba) | 21 | (3136, 7680) | 227,136 |
64
- | up_proj (MLP) | 17 | (12544, 3136) | 266,560 |
65
- | down_proj (MLP) | 17 | (3136, 12544) | 266,560 |
66
- | q/k/v/o_proj (Attn) | 4 | various | 99,328 |
67
- | **Total dim_sum** | | | **1,293,024** |
68
-
69
- - **rank=32**: 41,376,768 trainable params (1.04% of model)
70
- - **alpha=64** (2× rank)
71
- - Optimizer memory: 0.33 GB (vs 31.8 GB for full fine-tune!)
72
- - Est. peak memory: ~58 GB (safe for 128 GB DGX Spark)
73
-
74
- Previous LoRA SFT failed because it only targeted q/k/v/o_proj = 4 layers. With in_proj/out_proj, LoRA now touches the Mamba layers too.
75
-
76
- ### Reward Function v3
77
-
78
- 4 components, validated against base model outputs (Pearson r=0.58 with eval v2):
79
-
80
- | Component | Weight | What it measures |
81
- |-----------|--------|-----------------|
82
- | R1: Gate | 30% | Clean turn (not meta, not user-prefix, not empty) |
83
- | R2: Question | 35% | Single focused question, ≤60 words, at end of response |
84
- | R3: Guest ref | 20% | Builds on guest content (word overlap, capped to prevent parrot gaming) |
85
- | R4: Penalties | 15% | Filler, generic, repetition, formulaic patterns |
86
-
87
- Anti-gaming measures:
88
- - Guest word overlap capped at 6 (prevents copy-paste)
89
- - Question must be in last sentence (prevents question-stuffing then rambling)
90
- - Parrot detection: penalizes if >50% response words are from guest
91
- - Long contiguous phrase detection (5+ gram copy)
92
- - Keyword stuffing penalty (3+ depth buzzwords in <30 words)
93
-
94
- Validation results:
95
- - Meta/CoT responses: -0.14 avg (correctly penalized)
96
- - Clean responses: +0.34 avg
97
- - Range: [-0.88, +0.68] (wide spread for GRPO)
98
- - High reward responses score 7.3/10 on eval vs 6.0/10 for low reward
99
-
100
- ### GRPO Training Config
101
-
102
- ```
103
- Model: Nemotron 3 Nano 4B (base, fresh)
104
- Method: GRPO with LoRA
105
- Reward: v3 (validated, anti-gaming)
106
- LoRA rank: 32 (1.04%, 41.4M params)
107
- LoRA alpha: 64
108
- LoRA targets: in_proj, out_proj, up_proj, down_proj, q_proj, k_proj, v_proj, o_proj
109
- LR: 5e-5 (10× full-FT per tinker-cookbook guidance, conservative for GRPO)
110
- Beta (KL): 0.04
111
- Batch size: 2
112
- Grad accum: 4 (eff batch = 8)
113
- Num generations: 2
114
- Max prompt: 384
115
- Max completion: 128
116
- Temperature: 0.9
117
- Prompts: 512
118
- Epochs: 1
119
- Est. peak mem: ~58 GB
120
- ```
121
-
122
- ## Data
123
-
124
- ### v4 Dataset (for SFT — completed)
125
- - 6,335 segments from 108 guests, 113 episodes
126
- - Quality filtered: min score 3, 100% ends-on-assistant, 100% last-asks-question
127
- - Source: `data/interview_segments_v4.jsonl` (pre-formatted), `data/interview_segments_v4_messages.jsonl` (raw messages)
128
-
129
- ### GRPO Prompts
130
- - Derived from v4 messages: system + last user message per segment
131
- - 512 prompts sampled for calibration run
132
- - Full 6,335 available for main training
133
-
134
- ## Architecture Notes
135
-
136
- ### Nemotron 3 Nano 4B Hybrid Architecture
137
- - 42 layers total: 38 Mamba-2 + 4 Attention (layers 12, 17, 24, 32)
138
- - 3,973,556,832 parameters
139
- - Projection modules per type:
140
- - Mamba: `in_proj` (17504→3136), `out_proj` (3136→7680) — 21 layers each
141
- - MLP: `up_proj` (12544→3136), `down_proj` (3136→12544) — 17 layers each
142
- - Attention: `q_proj` (5120→3136), `k_proj` (1024→3136), `v_proj` (1024→3136), `o_proj` (3136→5120) — 4 layers each
143
-
144
- ### DGX Spark (GB10) Environment
145
- - 128 GB unified memory (CPU+GPU shared LPDDR5X)
146
- - ~273 GB/s memory bandwidth
147
- - CUDA 12.1, Compute 12.1 (Blackwell)
148
- - Full fine-tune GRPO OOMs (~112 GB peak) — must use LoRA
149
- - Unsloth quirks: forces `remove_unused_columns=True` (patched), no Flash Attention 2
150
-
151
- ## Technical Lessons
152
-
153
- ### SFT
154
- 1. **Completion-only loss masking is critical** for multi-turn chat — without it, model learns to predict user role tokens
155
- 2. **Chat template must match inference format** — `<think></think>` tags in training data
156
- 3. **Data quality > quantity** — filtering from 7,580 → 6,335 improved quality metrics
157
- 4. **3 epochs overfits** on 6,335 samples — 2 is the max for this dataset size
158
- 5. **Packing doesn't speed up on GB10** — compute-bound, not memory-bound
159
-
160
- ### LoRA
161
- 1. **Target ALL projection types** on hybrid architectures — not just attention
162
- 2. **Previous LoRA failures were from targeting only 4/42 layers**, not from LoRA being incompatible
163
- 3. **LoRA parameter count formula**: `rank × Σ(shape[0] + shape[1])` per tinker-cookbook
164
-
165
- ### GRPO
166
- 1. **Full fine-tune GRPO OOMs on DGX Spark** — policy + reference + optimizer + scoring = ~112 GB
167
- 2. **LoRA GRPO is ~58 GB** — optimizer drops from 31.8 GB to 0.33 GB
168
- 3. **Reward validation against eval is essential** before training
169
- 4. **Reward needs anti-gaming** — LLMs will exploit keyword matching, parroting, formulaic patterns
170
- 5. From routangseng: reward bounded [-1, 1], few strong signals > many weak ones, `prompt` column must be pre-formatted text
171
-
172
- ## Timeline
173
- - **Mar 18**: Project started, data crawl from 113 episodes
174
- - **Mar 19**: First LoRA SFT, environment debugging
175
- - **Mar 20**: 7-model eval comparison, full SFT v1-v2, v2-v4 datasets
176
- - **Mar 21 AM**: SFT v3-v5 experiments — SFT declared dead end
177
- - **Mar 21 PM**: Eval v2 built (10 dimensions, 25 scenarios), DGX Spark optimization investigation
178
- - **Mar 21 EVE**: GRPO setup — reward v3 validated, OOM diagnosed, LoRA solution confirmed
179
- - **Mar 22**: GRPO LoRA training with rank=32, all projections
180
-
181
- ---
182
-
183
- *Last updated: 2026-03-22*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/LLAMA_FINETUNE_INVESTIGATION.md DELETED
@@ -1,44 +0,0 @@
1
- # llama-finetune Investigation for Nemotron-3-Nano-4B
2
-
3
- > Date: 2026-03-24
4
-
5
- ## Two Bugs Found
6
-
7
- ### Bug 1: Buffer Underflow (small input files)
8
- - **Location**: `common/common.cpp:1696`
9
- - **Code**: `ndata = (tokens.size() - ne_datapoint - 1) / stride`
10
- - **Cause**: When `tokens.size() < ne_datapoint + 1` (input shorter than context), `ndata` underflows to near-max uint64 (~18.4 exabytes allocation)
11
- - **GitHub Issue**: [#15139](https://github.com/ggml-org/llama.cpp/issues/15139) (Aug 2025, open, no fix)
12
- - **Fix**: Trivial — add `if (tokens.size() <= ne_datapoint + 1) { error("input too short"); }`
13
- - **Workaround**: Use input file with more tokens than context length ✅ (we verified this works)
14
-
15
- ### Bug 2: Backward Pass Assert Failure (fundamental)
16
- - **Location**: `ggml/src/ggml.c:6998`
17
- - **Assert**: `!node->view_src || node->op == GGML_OP_CPY || GGML_OP_VIEW || GGML_OP_RESHAPE || GGML_OP_PERMUTE || GGML_OP_TRANSPOSE`
18
- - **Cause**: The NemotronH forward graph uses view tensors with operations not in the backward pass whitelist. Likely from the Mamba-2 SSM scan operation, which uses views for state manipulation.
19
- - **GitHub Issue**: [#15279](https://github.com/ggml-org/llama.cpp/issues/15279) (Aug 2025, open, no fix) — same assert for Saiga Nemo 12B
20
- - **Fix**: Non-trivial — requires adding backward pass support for SSM operations in GGML
21
- - **Workaround**: None. **llama-finetune cannot train Mamba/NemotronH models.**
22
-
23
- ## Reproduction
24
-
25
- ```bash
26
- # Bug 1: small input
27
- echo "Hello" > /tmp/small.txt
28
- llama-finetune -m nemotron-f16.gguf -f /tmp/small.txt -c 64
29
- # → ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 18446744073709547520
30
-
31
- # Bug 2: proper input (bypasses Bug 1)
32
- # Use 150K char training file
33
- llama-finetune -m nemotron-f16.gguf -f /tmp/train_proper.txt -c 128 -ngl 99
34
- # → GGML_ASSERT(!node->view_src || ...) at ggml.c:6998
35
- # → Crash in ggml_build_backward_expand → ggml_opt_build → llama_context::opt_epoch_iter
36
- ```
37
-
38
- ## Conclusion
39
-
40
- **llama-finetune does not support Mamba-2 / NemotronH architecture for training.**
41
-
42
- The backward pass graph builder in GGML cannot handle the SSM operations used by Mamba-2 layers. This is an upstream limitation, not a configuration issue. Both bugs are open on GitHub with no PRs or fixes as of 2026-03-24.
43
-
44
- llama.cpp supports NemotronH for **inference only** (merged Dec 2025, [PR #18058](https://github.com/ggml-org/llama.cpp/pull/18058)). Training support would require implementing backward passes for the SSM-specific GGML operations.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/LORA_V1_ANALYSIS.md DELETED
@@ -1,109 +0,0 @@
1
- # LoRA v1 vs SFT v5 — Deep Analysis
2
- *Generated: 2026-03-30*
3
-
4
- ## Results
5
-
6
- | Model | Score | 0/3 | 1/3 | 2/3 | 3/3 | on_topic | uses_guest | probing |
7
- |-------|-------|-----|-----|-----|-----|----------|------------|---------|
8
- | Base | 0.653 | 8% | 28% | 24% | 40% | 68% | 48% | 80% |
9
- | SFT-v5 (LoRA≈16, hidden) | 0.667 | 16% | 16% | 20% | 48% | 76% | 60% | **64%** ← damaged |
10
- | LoRA-v1 (r=64, explicit) | 0.733 | 4% | 20% | 28% | 48% | 72% | 56% | **92%** ← best |
11
-
12
- ## Finding 1: SFT v5 Was Not a True Full Fine-Tune
13
-
14
- `full_finetuning=True` was set, but Unsloth silently fell back to LoRA for NemotronH.
15
- Training log showed 10.1M/2.66B (0.38%) trainable — equivalent to r≈16 LoRA.
16
- LoRA v1: 40.5M/4.01B (1.01%) trainable — explicit r=64. **Both were LoRA.**
17
-
18
- ## Finding 2: Three Factors Drove the Performance Gap
19
-
20
- **A — Rank (r≈16 implicit → r=64 explicit)**
21
- 4x more trainable params → richer task subspace. r=64 can capture topic-tracking,
22
- guest-reference, and question depth simultaneously. r~16 can only shift broad style.
23
-
24
- **B — Overfitting from 3 epochs at low capacity**
25
- SFT v5 ran 897 steps (3 epochs). At 0.38% capacity, it memorized surface question
26
- patterns ("How do you...") without preserving depth. Probing collapsed 80%→64%.
27
- LoRA v1 at 1 epoch (299 steps) reinforced the base instinct lightly: probing 80%→92%.
28
-
29
- **C — LR mismatch (1e-5 vs 2e-4)**
30
- LR=1e-5 is appropriate for full fine-tuning. For LoRA adapters starting from random,
31
- it's too low → slow, shallow adaptation. LR=2e-4 is the correct scale for LoRA r=64.
32
-
33
- ## Finding 3: The Probing Dimension Is the Key Discriminator
34
-
35
- - `on_topic`: mostly in the frozen 2.66B backbone, marginally trainable
36
- - `uses_guest`: strongly in training signal (96% of pairs have word overlap), both models learned it
37
- - `probing`: **the critical one** — base model is already good at it (80%); 3 epochs of SFT destroyed it; 1 epoch of LoRA improved it
38
-
39
- ## Finding 4: LoRA v1's 3 Failures Are All `uses_guest` Regressions
40
-
41
- Pattern: when guest uses domain-specific jargon (graph/constraint, meta-prompt, organisms),
42
- LoRA generalizes the concept but loses the exact vocabulary. Training data coverage for
43
- abstract/technical domains may be thinner. These are edge cases — 3/25 prompts.
44
-
45
- ## Finding 5: Training Data Is Well-Suited for LoRA
46
-
47
- - 4,772 pairs: 697 real Lex + 4,075 generated
48
- - 96% have ≥1 guest word in the question
49
- - Avg 2.4 word overlap → strong uses_guest signal
50
- - Mean question length: 16.1 words (models produce 14.8-15.2) — well matched
51
- - Signal is learnable in 1 epoch at r=64. More epochs = surface memorization.
52
-
53
- ## Config Comparison
54
-
55
- | | SFT v5 | LoRA v1 |
56
- |---|---|---|
57
- | Trainable params | 10.1M (0.38%) | 40.5M (1.01%) |
58
- | Effective rank | ~16 (implicit) | 64 (explicit) |
59
- | LR | 1e-5 | 2e-4 |
60
- | Epochs | 3 | 1 |
61
- | Steps | 897 | 299 |
62
- | Dropout | 0.05 | 0 (full Unsloth fusion) |
63
- | Batch (eff) | 24 | 16 |
64
-
65
- ## Next Step Recommendations
66
-
67
- 1. **uses_guest gap (56% vs 60% for SFT-v5)**: Try LoRA v2 with more aggressive
68
- vocabulary-echo examples in training data, or train on real-Lex-only (697 pairs)
69
- to see if generated pairs are diluting the exact-vocab signal.
70
-
71
- 2. **Probing is near ceiling (92%)**: The bottleneck is now uses_guest.
72
- Getting uses_guest from 56% to 70%+ with probing maintained would push score ~0.80.
73
-
74
- 3. **GRPO from LoRA v1**: Use reward_v11 with LoRA v1 as starting checkpoint.
75
- GRPO can directly optimize the uses_guest×probing joint objective.
76
-
77
- ---
78
-
79
- ## LoRA v2 Results (2026-03-30) — Filter + Upsample Experiment
80
-
81
- ### Config
82
- - Dataset: `data/sft_v6_train.jsonl` (6,933 pairs)
83
- - Removed 1,324 generic-opener generated pairs (32% of generated)
84
- - Upsampled real Lex 6× → 4,182 effective real examples (60% of training)
85
- - Generic opener rate: 0% (was 26% in v5)
86
- - Same LoRA: r=64, LR=2e-4, 1 epoch, gradient_checkpointing="unsloth"
87
-
88
- ### Results
89
-
90
- | Model | Score | on_topic | uses_guest | probing |
91
- |-------|-------|----------|------------|---------|
92
- | Base | 0.653 | 68% | 48% | 80% |
93
- | **LoRA v1** (r=64, original data) | **0.733** | 72% | 56% | 92% |
94
- | LoRA v2 (filtered+upsampled) | 0.640 | 64% | **48%** | 80% |
95
-
96
- ### Finding: Filtering Didn't Help uses_guest
97
-
98
- uses_guest: 48% in LoRA v2 — **identical to base, same as before training**.
99
- The filtering + upsampling hypothesis was wrong, or at least insufficient.
100
-
101
- **Why it failed:**
102
- 1. The template contamination theory predicted filtering generic openers would let the model learn vocab-echo. But the model's uses_guest behavior didn't change at all.
103
- 2. Upsampling real Lex 6× didn't help either — the 697 real pairs have mean overlap of only 1.76 (lower than generated 2.21), so more real Lex doesn't automatically mean more vocab-echo.
104
- 3. The fundamental issue: the model already "knows" how to reference guest vocabulary (it does so 48% of the time from base). The bottleneck is not training signal — it's something deeper in how the model decides when to echo vs generalize.
105
-
106
- **Conclusion:** Data-side interventions (filtering, upsampling, prompt engineering) cannot push uses_guest beyond 48-56%. The mechanism is encoded in model weights at a level that SFT can only partially access via 4 attention-layer LoRA.
107
-
108
- **LoRA v1 (0.733) remains best.** The path forward is reward_v12 GRPO from the LoRA v1 checkpoint — directly suppressing template probability via negative advantage signal.
109
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/LORA_V2_NATIVE_RESULTS.md DELETED
@@ -1,66 +0,0 @@
1
- # LoRA v2 Native — Correct Kernel Forward Path (2026-04-02)
2
-
3
- ## Summary
4
-
5
- **Score: 0.760** — Best ever on functional eval. First fine-tune to clearly improve `uses_guest` without damaging `probing`.
6
-
7
- ## The Kernel Fix
8
-
9
- All previous NemotronH training on GB10 (Blackwell SM 12.1) used a broken forward path:
10
-
11
- - NVIDIA's custom `modeling_nemotron_h.py` (via `trust_remote_code=True`) hardcoded `is_fast_path_available = False`
12
- - This forced the naive `torch_forward` SSM scan which compounds numerical errors across 42 layers
13
- - Result: PPL ~2,126 vs vLLM's ~10-20 on the same weights
14
-
15
- **Fix:** Use native transformers 5.3.0 built-in NemotronH implementation with 3 patches:
16
-
17
- 1. **Config validator**: add `"mlp"` to valid block types (4B model has plain MLP layers, not MoE)
18
- 2. **MIXER_TYPES**: map `"mlp"` → `NemotronHMLP`
19
- 3. **block_type_to_mask**: add `"mlp": None`
20
- 4. **Config format**: convert `hybrid_override_pattern` string → `layers_block_type` list
21
-
22
- The native implementation uses `cuda_kernels_forward` with `mamba_chunk_scan_combined` — production Triton kernels that work correctly on SM 12.1.
23
-
24
- ## Training Config
25
-
26
- | Parameter | Value |
27
- |-----------|-------|
28
- | Base model | NVIDIA-Nemotron-3-Nano-4B |
29
- | Forward path | Native transformers `cuda_kernels_forward` |
30
- | LoRA rank | 64 |
31
- | LoRA alpha | 128 |
32
- | Target modules | q/k/v/o/gate/up/down_proj |
33
- | Dataset | `sft_v5_train.jsonl` (4,772 pairs) |
34
- | Epochs | 1 |
35
- | LR | 2e-4 (cosine, 30-step warmup) |
36
- | Batch size | 2 × 8 = 16 |
37
- | Thinking | Disabled (`enable_thinking=False`) |
38
- | Runtime | 12 min 35 sec (299 steps) |
39
- | Avg train loss | **1.289** |
40
- | W&B | `lex-sft-lora-v2-native-r64` (run `qlt9jzdc`) |
41
-
42
- ## Eval Results (3-judge functional)
43
-
44
- | Model | Score | on_topic | uses_guest | probing |
45
- |-------|-------|----------|------------|---------|
46
- | Base | 0.753 | — | 58% | 80% |
47
- | LoRA v1 (broken path) | 0.733 | — | 56% | 92% |
48
- | LoRA v2 old (broken path) | 0.667 | — | 44% | 64% |
49
- | **LoRA v2 native** | **0.760** | **80%** | **68%** | **80%** |
50
-
51
- ## Key Findings
52
-
53
- - **uses_guest: 68%** — +10pp over base, +24pp over broken LoRA v2
54
- - **probing: 80%** — stable (broken LoRA v2 destroyed this to 64%)
55
- - Train loss 1.289 vs 21.86 on broken path — the model was actually learning meaningful patterns
56
- - Correct forward path is necessary for any training on GB10 Blackwell
57
-
58
- ## Files
59
-
60
- - Training script: `scripts/train_sft_lora_v2_native.py`
61
- - Launch script: `run_sft_lora_v2_native.sh`
62
- - Adapter: `lora/sft-lora-v2-native/`
63
- - Native config: `models/NVIDIA-Nemotron-3-Nano-4B/config_native.json`
64
- - Patched transformers files (in `.venv-train`):
65
- - `transformers/models/nemotron_h/configuration_nemotron_h.py`
66
- - `transformers/models/nemotron_h/modeling_nemotron_h.py`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/MAMBA_SSM_BUILD_NOTES.md DELETED
@@ -1,114 +0,0 @@
1
- # mamba-ssm Build Notes — DGX Spark (GB10, aarch64)
2
-
3
- Date: 2026-03-26
4
- Status: ✅ Successfully installed
5
-
6
- ---
7
-
8
- ## Problem
9
-
10
- On the DGX Spark, `pip install mamba-ssm` fails with:
11
-
12
- ```
13
- RuntimeError: The detected CUDA version (12.0) mismatches the version that was
14
- used to compile PyTorch (13.0). Please make sure to use the same CUDA versions.
15
- ```
16
-
17
- PyTorch in `.venv-train` is compiled for CUDA 13.0, but `/usr/bin/nvcc` points to a CUDA 12.0 toolchain:
18
-
19
- ```bash
20
- $ which nvcc
21
- /usr/bin/nvcc
22
- $ nvcc --version
23
- Cuda compilation tools, release 12.0, V12.0.140
24
- ```
25
-
26
- The DGX Spark actually has CUDA 13.0 at `/usr/local/cuda-13.0/`. The default symlink `/usr/local/cuda` → `/etc/alternatives/cuda` resolves to the wrong version.
27
-
28
- ---
29
-
30
- ## Fix
31
-
32
- Override `CUDA_HOME` and `PATH` to point at the correct CUDA version:
33
-
34
- ```bash
35
- cd /home/bobber/lex-ft
36
- source .venv-train/bin/activate
37
-
38
- CUDA_HOME=/usr/local/cuda-13.0 \
39
- PATH=/usr/local/cuda-13.0/bin:$PATH \
40
- TORCH_CUDA_ARCH_LIST="12.0" \
41
- pip install mamba-ssm causal-conv1d --no-build-isolation
42
- ```
43
-
44
- Notes:
45
- - `TORCH_CUDA_ARCH_LIST="12.0"` — targets GB10 Blackwell (SM 12.1)
46
- - `--no-build-isolation` — uses the venv's torch for cpp_extension compatibility
47
- - Build takes ~45 minutes (9 CUDA kernel files for mamba-ssm, aarch64 compilation is slower)
48
-
49
- ---
50
-
51
- ## Installed Versions
52
-
53
- ```
54
- mamba_ssm-2.3.1-cp312-cp312-linux_aarch64.whl (351 MB)
55
- causal_conv1d-1.6.1
56
- ```
57
-
58
- Cached at: `~/.cache/pip/wheels/28/83/54/d45107838fec575b93f5d723f56351cee19a1b13bcd4ec9f3f`
59
-
60
- Future reinstalls in the same venv will use the cached wheel (no recompile).
61
-
62
- ---
63
-
64
- ## Verification
65
-
66
- ```python
67
- import causal_conv1d
68
- print(causal_conv1d.__version__) # 1.6.1
69
- print(causal_conv1d.causal_conv1d_fn) # <function causal_conv1d_fn at 0x...>
70
-
71
- import mamba_ssm
72
- print(mamba_ssm.__version__) # 2.3.1
73
- from mamba_ssm.ops.triton.selective_state_update import selective_state_update
74
- print(selective_state_update) # <function selective_state_update at 0x...>
75
- ```
76
-
77
- Both are non-None — the fast CUDA path is active.
78
-
79
- After import, the Nemotron model will no longer print:
80
- ```
81
- WARNING: The fast path is not available because one of
82
- (selective_state_update, causal_conv1d_fn, causal_conv1d_update) is None.
83
- Falling back to the naive implementation.
84
- ```
85
-
86
- ---
87
-
88
- ## What This Fixes
89
-
90
- Without `selective_state_update`, the decode step falls back to Python with BF16 arithmetic. This produces wrong SSM states vs training conditions, causing P(`</think>`) ≈ 0 — the model never closes its thinking block.
91
-
92
- With the real CUDA kernel:
93
- - Decode runs in float32 (matches llama.cpp behavior)
94
- - SSM state matches training distribution
95
- - P(`</think>`) should be non-trivial
96
- - Enables fully on-policy GRPO without llama.cpp server
97
-
98
- ---
99
-
100
- ## Files Affected
101
-
102
- When rebuilding a fresh venv on this machine, always use the CUDA 13.0 path:
103
-
104
- ```bash
105
- export CUDA_HOME=/usr/local/cuda-13.0
106
- export PATH=/usr/local/cuda-13.0/bin:$PATH
107
- ```
108
-
109
- Add to `.venv-train/bin/activate` if you want it persistent:
110
-
111
- ```bash
112
- echo 'export CUDA_HOME=/usr/local/cuda-13.0' >> .venv-train/bin/activate
113
- echo 'export PATH=/usr/local/cuda-13.0/bin:$PATH' >> .venv-train/bin/activate
114
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/NEMOTRON_GB10_DEEP_DIVE.md DELETED
@@ -1,321 +0,0 @@
1
- # Nemotron-3-Nano-4B on GB10: Complete Deep Dive
2
-
3
- > Date: 2026-03-24
4
- > Purpose: Comprehensive record of all findings, blockers, and alternatives for training Nemotron-4B on NVIDIA DGX Spark (GB10, SM 12.1)
5
-
6
- ## Executive Summary
7
-
8
- **Nemotron-3-Nano-4B cannot be trained on GB10.** The model uses Mamba-2 (SSM) layers that require custom CUDA/Triton kernels (`mamba_ssm`, `causal_conv1d`). These kernels don't compile on SM 12.1 (too new), and no pure-PyTorch fallback produces correct outputs. The model works perfectly for **inference** via llama.cpp.
9
-
10
- ## Hardware Context
11
-
12
- - **GPU**: NVIDIA GB10 (Blackwell), SM 12.1, 128 GB unified memory
13
- - **SM 12.1** is the newest compute capability — released ahead of full software ecosystem support
14
- - **What works on GB10**: llama.cpp (hand-written CUDA, added SM 12.1 explicitly), standard PyTorch ops, vLLM 0.18.0 (with torch 2.10)
15
- - **What doesn't**: Any Python package with custom CUDA kernels compiled for SM ≤ 12.0
16
-
17
- ## The Mamba-2 Problem
18
-
19
- NemotronH architecture = 38 Mamba-2 layers + 4 Attention layers + MLP layers (42 total).
20
-
21
- The Mamba-2 layers require three custom kernel packages:
22
-
23
- ### 1. `mamba_ssm` (state-spaces/mamba)
24
- - **What**: Triton kernel for gated RMSNorm (`rmsnorm_fn`) and SSM scan (`mamba_chunk_scan_combined`)
25
- - **SM 12.1 status**: Won't compile (Triton JIT generates incompatible PTX)
26
- - **Why needed**: `rmsnorm_fn` is a gated RMSNorm with group normalization — used inside every Mamba-2 mixer. Not a standard operation.
27
-
28
- ### 2. `causal_conv1d` (Dao-AILab)
29
- - **What**: Fused causal 1D convolution CUDA kernel
30
- - **SM 12.1 status**: Binary `.so` has undefined symbols (`_ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib`)
31
- - **Why needed**: Applied before SSM scan in every Mamba layer
32
-
33
- ### 3. `deep_ep` (DeepSeek)
34
- - **What**: Expert parallelism for MoE models
35
- - **SM 12.1 status**: Only targets `sm_90`, aarch64 glibc headers incompatible with nvcc
36
- - **Why needed**: Hard dependency of NeMo RL's `[vllm]` extra (not actually needed for 4B dense model, but can't be skipped)
37
-
38
- ## All Training Paths Attempted
39
-
40
- ### Path 1: NeMo RL (bare metal)
41
- - **Result**: ❌ `deep_ep` compilation failure
42
- - **Detail**: Ray worker isolation creates fresh venvs, re-triggers the build
43
-
44
- ### Path 2: NeMo RL (Docker)
45
- - **Result**: ❌ Triton JIT failure in vLLM worker
46
- - **Detail**: Container has CUDA 12.9, but Triton generates incompatible PTX for SM 12.1
47
- - **Note**: Container warns "WARNING: Detected NVIDIA GB10 GPU, which may not yet be supported"
48
-
49
- ### Path 3: PyTorch + `trust_remote_code=True`
50
- - **Result**: ❌ `causal_conv1d` ImportError (broken `.so`)
51
- - **Fix attempted**: Uninstalled broken packages
52
- - **New result**: ❌ `mamba_ssm` hard-required at import time, `raise ImportError("mamba-ssm is required")`
53
-
54
- ### Path 4: PyTorch + native transformers (no `trust_remote_code`)
55
- - **Result**: ❌ Config parser broken
56
- - **Detail**: `hybrid_override_pattern` uses `-` for MLP layers, but native transformers only recognizes `mamba`, `attention`, `moe` — not `mlp`
57
- - **Fix attempted**: Patched `_pattern_to_list` to map `-` → `mlp`
58
- - **New result**: ❌ `layers_block_type contains invalid types: {'mlp'}`
59
- - **Root cause**: transformers 5.3.0's NemotronH implementation doesn't have MLP as a block type. The model's architecture doesn't match what native transformers expects.
60
-
61
- ### Path 5: PyTorch + mocked `mamba_ssm`
62
- - **Result**: ❌ Model loads but produces garbage outputs
63
- - **Detail**: Mocked `rmsnorm_fn` with exact reference implementation from mamba_ssm source. Also mocked `selective_state_update = None`, `causal_conv1d_fn = None`.
64
- - **Diagnostics**:
65
- - All 263 weights load correctly
66
- - All 42 layers produce non-trivial activations (no NaN, no collapse)
67
- - BUT: model generates only spaces/punctuation (greedy decoding produces whitespace)
68
- - Per-token log-probs: -4 to -12 (expected: -1 to -3 for working model)
69
- - Perplexity: ~4,595 (expected: ~10-20)
70
- - Top predictions at every position: `' '`, `'2'`, `'\n'`, `','` — not word tokens
71
- - **Root cause**: NVIDIA's `torch_forward` is a "naive implementation" (their comment) — it was never intended to match the Triton kernel outputs exactly. The SSM scan, chunking, and numerical precision differ enough that 38 layers of accumulated error produce garbage.
72
-
73
- ### Path 6: GRPO v4 (llama.cpp generation + PyTorch mock training)
74
- - **Result**: ❌ Ran 24 steps, loss oscillated wildly (-6.5 to +8.5), reward flat
75
- - **Detail**: The training loop was architecturally correct (on-policy, LoRA, GGUF sync). But the PyTorch model's garbage log-probs meant the policy gradient signal was noise.
76
- - **Training data**: 24 steps × ~17 min = 6 hours wasted
77
-
78
- ### Path 7: `llama-finetune` (GGML native training)
79
- - **Result**: ❌ Two bugs
80
- - **Bug 1**: Buffer underflow when input shorter than context (trivial fix, worked around with larger input)
81
- - **Bug 2**: `GGML_ASSERT(!node->view_src || ...)` failure in `ggml_build_backward_expand`. GGML's backward pass doesn't support SSM operations.
82
- - **GitHub issues**: [#15139](https://github.com/ggml-org/llama.cpp/issues/15139), [#15279](https://github.com/ggml-org/llama.cpp/issues/15279) — both open since Aug 2025, no fix
83
- - **Root cause**: llama.cpp added Mamba-2 **inference** support (Dec 2025, PR #18058) but not backward pass. Implementing SSM gradients in GGML is weeks of work.
84
-
85
- ## Diagnostic Evidence
86
-
87
- ### Layer-by-layer activation analysis (mock model)
88
- ```
89
- Block Mean Std Status
90
- embedding -0.0001 0.0128 OK
91
- block_00_mamba -0.0001 0.0137 OK
92
- ...
93
- block_20_mlp -0.0325 0.9453 OK
94
- ...
95
- block_41_mlp -0.6250 24.8750 OK
96
- lm_head -2.1406 1.9297 OK
97
- ```
98
- Activations are fine — variance grows normally through layers. No collapse, no NaN. The model architecture is correct, but the probability distributions are wrong.
99
-
100
- ### Token prediction analysis (mock model)
101
- ```
102
- pos 0: actual=' is' rank=116 | top5: ' '(10.94), '2'(10.88), '3'(10.12)
103
- pos 2: actual=' meaning' rank=239 | top5: ' '(12.50), '\n'(11.12), ' ('(10.88)
104
- pos 4: actual=' life' rank=4632 | top5: ' '(11.69), ' of'(10.56), ','(10.19)
105
- ```
106
- The model ranks common words at positions 100-5000+, while spaces and numbers are top-ranked. This is consistent with the SSM scan producing wrong state transitions.
107
-
108
- ### Memory test (dual model coexistence)
109
- ```
110
- vLLM subprocess: ~39 GB (0.3 × 130 GB)
111
- HF training model: ~9 GB (BF16 + LoRA)
112
- Total: ~55 GB | Free: ~76 GB
113
- ```
114
- Memory is not a bottleneck. Both models fit comfortably.
115
-
116
- ## What Works on GB10
117
-
118
- | Capability | Tool | Status |
119
- |-----------|------|--------|
120
- | Nemotron-4B inference | llama.cpp | ✅ Perfect, 30-60 tok/s |
121
- | Nemotron-4B inference | vLLM 0.18.0 (routangseng venv) | ✅ 3.5 tok/s |
122
- | Standard transformer training | PyTorch | ✅ No custom kernels needed |
123
- | Standard transformer inference | llama.cpp, vLLM | ✅ |
124
- | LoRA training (standard models) | PyTorch + PEFT | ✅ Tested with Qwen, Llama |
125
-
126
- ## What Would Fix This
127
-
128
- ### Short-term fixes (someone else needs to do the work)
129
- 1. **PyTorch 2.11+** with SM 12.1 in supported range → `mamba_ssm` recompile
130
- 2. **Triton update** with SM 12.1 PTX codegen → NeMo RL Docker works
131
- 3. **llama.cpp backward pass** for Mamba-2 → `llama-finetune` works
132
- 4. **NVIDIA's `torch_forward` fixed** to match Triton kernel → PyTorch mock works
133
-
134
- ### What we can do now
135
- 1. **Option 2**: SFT distillation — Nemotron as teacher (llama.cpp), standard transformer as student
136
- 2. **Option 3**: Rent A100/H100 ($1-3/hr), train with NeMo RL directly
137
- 3. **Ship base model** (4.35/5 eval score, already best-in-class)
138
-
139
- ## On A100/H100
140
-
141
- Every blocker is SM 12.1 specific. On A100 (SM 8.0) or H100 (SM 9.0):
142
- - `mamba_ssm`, `causal_conv1d`: ✅ Primary build targets
143
- - NeMo RL: ✅ NVIDIA trains their own models on H100 clusters
144
- - vLLM: ✅
145
- - flash-attn: ✅
146
- - Estimated cost: $10-25 for a full training run (4-8 hours on A100)
147
-
148
- ## Key Lessons
149
-
150
- 1. **SM 12.1 is ahead of the software ecosystem.** The DGX Spark hardware works, but the Python ML toolchain hasn't caught up. Every custom CUDA kernel package needs to add SM 12.1 support independently.
151
-
152
- 2. **llama.cpp works because it controls the whole stack.** One team, one codebase, added SM 12.1 to CMakeLists.txt, done. The Python ecosystem is a chain of 6+ independent projects that all need to update.
153
-
154
- 3. **A model that loads correctly can still be completely broken.** All 263 weights loaded, all 42 layers had non-trivial activations, but the outputs were garbage. Always validate generation quality before training.
155
-
156
- 4. **NVIDIA's `torch_forward` is not a production fallback.** It's labeled "naive implementation" for a reason. It produces directionally correct activations but wrong probability distributions.
157
-
158
- 5. **Mocking CUDA kernels is dangerous.** Even with the exact reference implementation from the same author, the mock didn't match. Numerical precision differences compound across 38 Mamba layers.
159
-
160
- ## A100 Colab Reference Test (2026-03-24 22:00-23:00 UTC)
161
-
162
- ### Setup
163
- - Google Colab A100-SXM4-40GB
164
- - SSH tunnel via bore (bore.vexorium.net)
165
- - Installed real `mamba_ssm` 2.3.1 + `causal_conv1d` 1.6.1 (pre-built wheels saved to `lex-ft/wheels/`)
166
- - transformers 5.0.0, torch 2.10.0+cu128
167
-
168
- ### Critical Finding: Model is Broken in HuggingFace on ALL GPUs
169
-
170
- **Even on A100 with real CUDA kernels, the model generates garbage:**
171
-
172
- ```
173
- RAW TEXT: "What is the meaning of life? 22,,. 22, 1 "
174
- CHAT TEMPLATE: "????????????????????????????????????????"
175
- ```
176
-
177
- Top-10 predictions at last position (A100, real CUDA kernels):
178
- ```
179
- '\n' logit=14.12
180
- ' ' logit=11.06
181
- '2' logit=10.94
182
- '\n\n' logit=10.88
183
- '1' logit=10.69
184
- '3' logit=10.56
185
- ```
186
-
187
- The model predicts newlines, spaces, and numbers — not words. **This happens on A100 with `is_fast_path_available=True` and `cuda_kernels_forward` active.**
188
-
189
- ### What This Means
190
-
191
- 1. **The mock was NOT the problem.** We spent 30+ hours blaming the `rmsnorm_fn` mock and SM 12.1 toolchain, but the HuggingFace loading itself is broken.
192
- 2. **llama.cpp works perfectly** with the same model weights (BF16 GGUF), proving the weights are correct.
193
- 3. **The bug is in HuggingFace `from_pretrained` + `trust_remote_code=True`** for this specific model on transformers 5.0.0.
194
-
195
- ### Layer-by-Layer Comparison (A100 vs GB10)
196
-
197
- Reference tensors captured and compared:
198
- - Embedding: ✅ Perfect match (0.0 diff)
199
- - block_00_mamba: ❌ 27% relative difference (diverges immediately)
200
- - All subsequent layers: ❌ Increasing divergence
201
-
202
- But since the A100 model ALSO produces garbage, the divergence between A100 and GB10 is between **two broken implementations**, not "correct vs broken."
203
-
204
- ### CUDA vs torch_forward on A100
205
-
206
- Single Mamba layer comparison:
207
- ```
208
- cuda_kernels_forward: mean=-0.000183, std=0.039062
209
- torch_forward: mean=-0.000233, std=0.041504
210
- Max diff: 0.252930, Relative diff: 48.7%
211
- ```
212
-
213
- The two paths produce significantly different outputs, but **neither produces correct model behavior.**
214
-
215
- ### Possible Root Causes (to investigate)
216
-
217
- 1. **transformers version incompatibility**: The model's custom code may require a specific older transformers version
218
- 2. **Weight mapping bug**: The HF repo weights may be mapped to wrong layers by `from_pretrained`
219
- 3. **Config mismatch**: The `config.json` in the BF16 HF repo may differ from what the model code expects
220
- 4. **Missing post-processing**: The model may need specific initialization that `from_pretrained` skips
221
-
222
- ### Artifacts Saved
223
- - Pre-built wheels: `lex-ft/wheels/mamba_ssm-2.3.1-cp312-cp312-linux_x86_64.whl` (509 MB)
224
- - Pre-built wheels: `lex-ft/wheels/causal_conv1d-1.6.1-cp312-cp312-linux_x86_64.whl` (243 MB)
225
- - Reference tensors: `lex-ft/reference/reference_tensors.pt` (49 MB) — A100 activations for 3 test texts
226
- - Colab notebook: `lex-ft/notebooks/capture_reference_tensors.ipynb`
227
-
228
- ## Updated Conclusion (2026-03-24 23:00 UTC)
229
-
230
- **The root cause was misidentified.** We blamed SM 12.1 and the `rmsnorm_fn` mock for 30+ hours, but the A100 test proves:
231
-
232
- 1. The model generates garbage on A100 too (real CUDA kernels, `is_fast_path_available=True`)
233
- 2. The `torch_forward` and `cuda_kernels_forward` do diverge (49%), but neither is correct
234
- 3. llama.cpp generates perfect text with the same weights
235
-
236
- **The real bug is in HuggingFace transformers' loading/inference of this model**, not in GPU compatibility. This completely changes the path forward:
237
-
238
- - **If we fix the HF loading bug**: Training works on ANY GPU (including GB10 with mock)
239
- - **If we can't fix it**: Option 2 (SFT distillation) or Option 3 (cloud + NeMo RL) remain viable
240
-
241
- ### Next Steps
242
- 1. Investigate why HF `from_pretrained` produces garbage while llama.cpp works
243
- 2. Check if NVIDIA's NeMo toolkit loads this model correctly (it should — they train with it)
244
- 3. Check if the HF repo `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` has a known issue or requires specific transformers version
245
- 4. Compare weight names/shapes between HF and GGUF to find mapping errors
246
-
247
- ---
248
-
249
- ## Q8 vs BF16 Mismatch Quantification (2026-03-25 03:35 UTC)
250
-
251
- ### Test Setup
252
- - llama.cpp Q8 server: generates greedy next token at each position
253
- - PyTorch BF16 model (transformers 4.48.3 + mock `torch_forward`): computes log-probs
254
- - 3 test texts, 10 positions each
255
-
256
- ### Results
257
-
258
- | Metric | Value |
259
- |--------|-------|
260
- | Q8-BF16 top-1 agreement | **27%** |
261
- | Q8-BF16 top-5 agreement | **43%** |
262
- | PyTorch BF16 perplexity | **250-443** (expected: 10-50 for working model) |
263
-
264
- ### Position-Level Examples
265
- ```
266
- Text: "The meaning of life is to find purpose..."
267
- pos 3: actual=' is' PT=' of' Q8=' is' PT_lp=-3.91
268
- pos 5: actual=' find' PT=' the' Q8=' find' PT_lp=-5.07
269
- pos 6: actual=' purpose' PT=' a' Q8=' your' PT_lp=-10.59
270
- ```
271
-
272
- **llama.cpp Q8 predictions are much closer to ground truth than PyTorch BF16.** Q8 predicts "is", "find" correctly; PyTorch predicts "of", "the" — generic tokens, not context-appropriate ones.
273
-
274
- ### Root Cause: `torch_forward` SSM Scan is Numerically Wrong
275
-
276
- The PyTorch perplexity of 250-443 on simple English text (should be 10-50) confirms that the `torch_forward` naive SSM implementation produces **wrong probability distributions**, not just slightly different ones.
277
-
278
- This is NOT a quantization mismatch (Q8 vs BF16). Even comparing BF16 PyTorch against the same BF16 weights in llama.cpp would show the same divergence — the issue is `torch_forward` vs llama.cpp's C++/CUDA Mamba-2 kernels computing different results.
279
-
280
- Earlier evidence supports this:
281
- - A100 reference test: block_00_mamba already diverges 27% from torch_forward
282
- - `cuda_kernels_forward` vs `torch_forward` on A100: 49% relative difference on same layer
283
- - NVIDIA's code labels this path "naive implementation" — it was never intended for production
284
-
285
- ### Implications for Training
286
-
287
- **GRPO with Q8 generation + BF16 torch_forward training is fundamentally unstable** because:
288
- 1. The model that generates (llama.cpp) and the model that computes gradients (PyTorch) disagree on 73% of top-1 predictions
289
- 2. Policy ratios `π_new/π_old` are noise when the two probability distributions barely overlap
290
- 3. This explains the oscillating loss (-6.2 to +3.9) and exploding grad norms (644-1207) in all GRPO runs
291
-
292
- **SFT training is also compromised** because the forward pass computes wrong CE loss:
293
- - CE loss 3.88 with 4.48.3 (looked reasonable but perplexity 250-443 is NOT reasonable)
294
- - The model "learns" but from wrong gradients
295
- - This is why the smoke test SFT produced "assistant assistant assistant" — the model was memorizing surface patterns, not learning from correct probability gradients
296
-
297
- ## Final Definitive Conclusion (2026-03-25)
298
-
299
- **Nemotron-3-Nano-4B cannot be correctly trained on GB10 with ANY approach** because the `torch_forward` Mamba-2 SSM scan produces numerically incorrect results. This is true regardless of:
300
- - transformers version (4.48.3 or 5.x)
301
- - GPU (GB10 or A100 — A100 with `torch_forward` has the same issue)
302
- - Training method (GRPO, SFT, custom loop)
303
- - Mock implementation (Kaggle-style, our version, any pure PyTorch)
304
-
305
- The ONLY correct training paths require the real `mamba_ssm` CUDA/Triton kernels, which requires SM 8.0+ (A100/H100) — NOT available on GB10 (SM 12.1).
306
-
307
- ### Viable Paths Forward
308
-
309
- 1. **Cloud A100 training** ($10-25): Install transformers 4.48.3 + real `mamba_ssm` kernels on A100. Train with NeMo RL or SFTTrainer. Deploy result on GB10 via llama.cpp.
310
-
311
- 2. **SFT distillation to standard transformer**: Use llama.cpp Nemotron for generation (works perfectly), train a Qwen/Llama student model (standard transformers, no Mamba-2) on that data. No `torch_forward` needed.
312
-
313
- 3. **Ship base model**: Nemotron-4B base scores 4.35/5. Already best-in-class.
314
-
315
- ---
316
-
317
- *Total investigation time: ~40 hours across 2026-03-23 to 2026-03-25*
318
- *Approaches tried: 9*
319
- *Lines of test code written: ~4,000*
320
- *Training steps run: 33 (all from wrong gradients)*
321
- *Key finding: `torch_forward` SSM scan is numerically wrong — not a workaround-able issue*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/NEMO_RL_SETUP_NOTES.md DELETED
@@ -1,168 +0,0 @@
1
- # NeMo RL Setup Notes — GB10
2
-
3
- > Date: 2026-03-23
4
-
5
- ## Installation Status
6
-
7
- ### ✅ Base NeMo RL (v0.5.0rc0)
8
- - Cloned to `/home/bobber/nemo-rl` (with submodules)
9
- - Python 3.12, torch 2.9.0, uv 0.11.0
10
- - `uv venv` + `uv run python -c "import nemo_rl"` → works
11
- - 215 packages installed
12
-
13
- ### ✅ vLLM 0.11.2 (manually installed)
14
- - `uv pip install vllm` → installed vLLM 0.11.2
15
- - NemotronH model support: `nemotron_h.py` exists ✅
16
- - Mamba2 support: `mamba2.py` exists ✅
17
-
18
- ### ❌ `[vllm]` extra failed
19
- - `deep_ep` (DeepSeek Expert Parallelism) fails to compile on aarch64
20
- - Targets `sm_90` only, `__builtin_dynamic_object_size` not found in glibc headers with nvcc
21
- - This is a multi-node MoE optimization — **not needed** for single-GPU 4B training
22
- - Workaround: installed vLLM directly without deep_ep
23
-
24
- ### ❌ Not yet tested
25
- - `flash-attn`, `mamba-ssm`, `causal-conv1d` (the `[fsdp]` extra)
26
- - These had build issues in the routangseng venv too
27
- - NeMo RL's DTensor training backend might need them
28
- - vLLM has its own Mamba kernels that work on GB10
29
-
30
- ## Architecture
31
-
32
- NeMo RL's GRPO flow:
33
- 1. **Generation**: vLLM generates completions (subprocess, ~39 GB)
34
- 2. **Training**: DTensor or Megatron backend trains the model (main process)
35
- 3. **Weight sync**: After training step, weights are synced to vLLM
36
- 4. **Repeat**: On-policy — same model generates and trains
37
-
38
- Key config file: `examples/configs/grpo_math_1B.yaml`
39
- - Uses `Qwen/Qwen2.5-1.5B` by default
40
- - Single GPU mode: `cluster.gpus_per_node: 1`
41
- - DTensor backend (PyTorch native)
42
- - vLLM generation with `gpu_memory_utilization: 0.6`
43
-
44
- ## LoRA GRPO for Nemotron-3-Nano-4B — Feasibility Analysis
45
-
46
- ### ✅ Verdict: Feasible, with caveats
47
-
48
- NeMo RL has **first-class LoRA GRPO support for NemotronH** architecture. NVIDIA ships a recipe for the 30B-A3B variant. The 4B model should work with adaptations.
49
-
50
- ### How NeMo RL's LoRA GRPO Works
51
-
52
- 1. **Model loading**: `AutoModelForCausalLM.from_pretrained()` with `trust_remote_code=True`
53
- 2. **LoRA injection**: `apply_lora_to_linear_modules()` wraps selected `nn.Linear` layers with `LinearLoRA`
54
- 3. **Training**: DTensor backend (PyTorch FSDP2) trains only LoRA params
55
- 4. **Weight sync to vLLM**: LoRA weights are **merged back** into base weights before sending to vLLM
56
- - `_maybe_merge_lora_weight()` computes `W + B×A × (alpha/dim)`
57
- - vLLM always sees the full merged model — no separate LoRA loading needed
58
- 5. **On-policy**: Same (merged) model generates and trains each step
59
-
60
- ### Reference: NVIDIA's 30B-A3B LoRA Recipe
61
-
62
- ```yaml
63
- # grpo-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
64
- policy:
65
- model_name: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
66
- dtensor_cfg:
67
- lora_cfg:
68
- enabled: true
69
- dim: 128 # rank
70
- alpha: 512 # scaling = alpha/dim = 4x
71
- exclude_modules: ['*out_proj*'] # ← KEY: Mamba2 out_proj has no gradient with CUDA kernels
72
- match_all_linear: false
73
- use_triton: false
74
- ```
75
-
76
- ### GB10-Specific Advantages for LoRA
77
-
78
- On GB10, `causal_conv1d` and `mamba_ssm` CUDA kernels **won't build** (SM 12.1 incompatibility). The model falls back to `torch_forward` (pure PyTorch). This actually **helps** us:
79
-
80
- - `torch_forward` path: all ops are standard PyTorch → **all layers have gradients**, including `out_proj`
81
- - 30B recipe excludes `out_proj` because `cuda_kernels_forward` doesn't backprop through it
82
- - On GB10 with `torch_forward`: we can **include** `out_proj` → train more of the model
83
-
84
- ### DTensor Parallelization for NemotronH
85
-
86
- NeMo RL has explicit NemotronH support in `nemo_rl/models/dtensor/parallelize.py`:
87
- - Custom `_parallelize_nm5_h()` function
88
- - Shards MLP layers: `mixer.up_proj` (Colwise), `mixer.down_proj` (Rowwise)
89
- - Mamba layers are NOT tensor-parallel sharded (they can't be easily split)
90
- - Activation checkpointing supported for both MLP and Mamba layers
91
- - For single GPU: no TP needed, just FSDP2
92
-
93
- ### Proposed 4B Single-GPU LoRA Config
94
-
95
- ```yaml
96
- defaults: grpo_math_1B.yaml
97
-
98
- policy:
99
- model_name: /home/bobber/lex-ft/models/NVIDIA-Nemotron-3-Nano-4B
100
- tokenizer:
101
- name: /home/bobber/lex-ft/models/NVIDIA-Nemotron-3-Nano-4B
102
- train_global_batch_size: 16
103
- train_micro_batch_size: 1
104
- logprob_batch_size: 1
105
- max_total_sequence_length: 800 # based on interview segment length data
106
- dtensor_cfg:
107
- lora_cfg:
108
- enabled: true
109
- dim: 64 # smaller rank than 30B (4B model needs less capacity)
110
- alpha: 256 # scaling = 4x
111
- # Do NOT exclude out_proj on GB10 — torch_forward gives us gradients everywhere
112
- exclude_modules: []
113
- match_all_linear: true # apply LoRA to ALL linear layers (Mamba + Attention + MLP)
114
- use_triton: false # no flash-attn on GB10
115
- activation_checkpointing: true # save memory
116
- cpu_offload: false
117
- sequence_packing:
118
- enabled: false # start simple
119
- generation:
120
- max_new_tokens: 800
121
- vllm_cfg:
122
- gpu_memory_utilization: 0.3 # tested safe value for GB10
123
- enforce_eager: true # skip CUDA graphs to save memory
124
-
125
- cluster:
126
- gpus_per_node: 1
127
- ```
128
-
129
- ### Memory Estimate (LoRA)
130
-
131
- From dual memory test:
132
- - vLLM subprocess: ~39 GB (0.3 utilization)
133
- - Base model (BF16): ~8 GB
134
- - LoRA params (rank 64, all linear): ~0.2 GB
135
- - LoRA optimizer states: ~0.4 GB
136
- - Activations/gradients: ~5-10 GB (with activation checkpointing)
137
- - **Total: ~53-58 GB | Free: ~73-78 GB** ← comfortable
138
-
139
- ### Known Risks
140
-
141
- 1. **causal_conv1d import error**: The model's `modeling_nemotron_h.py` does `from causal_conv1d import ...` at import time. It has a fallback (`causal_conv1d_fn = None`), but the import itself may fail with `ImportError` if a broken .so exists. May need to patch or mock.
142
-
143
- 2. **transformers version**: NeMo RL pins `transformers==4.57.1` (no native NemotronH). Uses `trust_remote_code=True` which loads NVIDIA's custom code. This is the intended path for NemotronH.
144
-
145
- 3. **vLLM 0.11.2 vs 0.18.0**: NeMo RL pins vLLM 0.11.2. Our dual memory test used 0.18.0 from the routangseng venv. Need to verify NemotronH works in 0.11.2 (it has `nemotron_h.py`, should be fine).
146
-
147
- 4. **torch_forward speed**: Pure PyTorch Mamba is slower than CUDA kernels. Training will be slower than on an A100/H100 with full kernel support. But it will be **correct**.
148
-
149
- 5. **Weight sync overhead**: Each GRPO step merges LoRA → syncs to vLLM → generates → trains. The merge is cheap (matrix multiply), but vLLM restart may not be needed if using colocated mode.
150
-
151
- ## Next Steps
152
-
153
- 1. **Create the GRPO config** (yaml above as starting point)
154
- 2. **Write custom reward function** for interviewer quality
155
- 3. **Test with a simple SFT run first** to validate the pipeline
156
- 4. **Then run LoRA GRPO** with the interviewer reward
157
-
158
- ## Key Advantages of NeMo RL over Custom GRPO
159
-
160
- | Feature | Custom GRPO v3 | NeMo RL |
161
- |---------|----------------|---------|
162
- | On-policy | ❌ llama.cpp ≠ HF | ✅ vLLM generates, same model trains |
163
- | KL reference | ❌ Missing | ✅ Built-in reference policy |
164
- | Architecture | ❌ LoRA only touched 4/42 layers | ✅ LoRA on ALL linear layers (torch_forward) |
165
- | Weight sync | ❌ Manual merge every N steps | ✅ Automatic merge+sync per step |
166
- | LoRA GRPO | ❌ Not supported | ✅ DTensor LoRA GRPO with merge-to-vLLM |
167
- | Tested on Nemotron | ❌ No | ✅ NVIDIA ships 30B-A3B recipe |
168
- | Mamba gradients | ❌ Only 4 attention layers | ✅ All 42 layers via torch_forward |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/ONNX_RETROSPECTIVE.md DELETED
@@ -1,404 +0,0 @@
1
- # ONNX Model Export & WebGPU Deployment — Retrospective
2
-
3
- ## Project
4
- Deploy Nemotron-3-Nano-4B (GRPO v12 fine-tuned LoRA) as a browser-based WebGPU chat app via HuggingFace Spaces using transformers.js.
5
-
6
- - **Space**: `bobber/lex-interviewer-chat` (static HF Space)
7
- - **Model**: `bobber/lex-interviewer-nemotron-4b-grpo-v12`
8
- - **Reference**: `onnx-community/NVIDIA-Nemotron-3-Nano-4B-BF16-ONNX` + `webml-community/Nemotron-3-Nano-WebGPU`
9
- - **Date**: 2026-03-31
10
-
11
- ---
12
-
13
- ## Timeline of Issues
14
-
15
- ### 1. WASM 404 / asyncify.mjs (ort-web version conflict)
16
- - **Symptom**: 404 on `ort-wasm-simd-threaded.asyncify.mjs`
17
- - **Cause**: `package.json` had ort-web pinned to 1.16.3, which doesn't ship `asyncify.mjs`
18
- - **Fix**: Remove ort-web override; let transformers.js 4.0.0-next.8 use its bundled ort-web 1.25.0-dev
19
-
20
- ### 2. Module.MountedFiles not available
21
- - **Symptom**: `Failed to load external data file "model_q4.onnx_data", error: Module.MountedFiles is not available`
22
- - **Cause**: Missing `transformers.js_config` in model's `config.json`. Without `use_external_data_format`, transformers.js falls back to the old Emscripten `Module.MountedFiles` API (removed in ort-web 1.25+)
23
- - **Fix**: Add `transformers.js_config: { use_external_data_format: { "model_q4.onnx": 2 } }` to config.json
24
-
25
- ### 3. ShapeInferenceError on INT64 constants
26
- - **Symptom**: `Cannot parse data from external tensors` for INT64 constants
27
- - **Cause**: ONNX repacking with `size_threshold=0` moved ALL tensors external, including small constants that ORT needs inline
28
- - **Fix**: Repack with `size_threshold=1024` (tensors < 1KB stay inline)
29
-
30
- ### 4. ArrayBuffer allocation failed
31
- - **Symptom**: `RangeError: Array buffer allocation failed`
32
- - **Cause**: Merging split data files into a single 2.55 GB blob exceeded browser's ~2 GB ArrayBuffer limit
33
- - **Fix**: Keep original 2-file split (~2.09 GB + ~465 MB), each under 2 GB
34
-
35
- ### 5. Missing chat_template
36
- - **Symptom**: `Cannot use apply_chat_template() because tokenizer.chat_template is not set`
37
- - **Cause**: Our model repo lacked `chat_template` in `tokenizer_config.json`
38
- - **Fix**: Copy chat_template from reference model's tokenizer_config.json
39
-
40
- ### 6. Numeric gibberish output
41
- - **Symptom**: Model generates random numbers and symbols
42
- - **Cause**: Wrong tokenizer — our repo had a different `tokenizer.json` (17 MB vs reference 12.6 MB). Same vocab size but different byte-pair encoding → token IDs decoded to wrong text
43
- - **Fix**: Use reference model's `tokenizer.json` and `tokenizer_config.json` (same base model, same vocab)
44
-
45
- ### 7. Trailing newlines / infinite generation
46
- - **Symptom**: Model generates answer then infinite `\n` and `<|im_end|>` tokens
47
- - **Cause**: `generation_config.json` had `eos_token_id: 2` but was missing token 11 (`<|im_end|>`). Model generated end-of-turn but didn't stop
48
- - **Fix**: Set `eos_token_id: [2, 11]` in both generation_config.json and as runtime override in Space code
49
-
50
- ### 8. "No response" with Reasoning On (THE BIG ONE)
51
- - **Symptom**: With `enable_thinking: true`, model outputs only `<|im_end|>` immediately (1 chunk, zero content). With `enable_thinking: false`, model works fine
52
- - **Root cause**: Our custom `quantize_to_matmulnbits()` re-quantized ALL 94 layers. The re-quantized Mamba layers had tiny precision differences from the reference's quantization. On WebGPU (float16 compute), these differences caused the model to output immediate EOS after the `<think>` token
53
- - **Why CPU worked**: CPU uses float32 for dequantization, which is more tolerant of the precision differences
54
- - **Fix**: LoRA-only patching — keep reference Q4 weights for non-LoRA layers (Mamba, embedding, lm_head), only re-quantize the 50 layers that LoRA actually changed (attention q/k/v/o_proj + MLP up/down_proj)
55
-
56
- ---
57
-
58
- ## Root Cause Analysis
59
-
60
- ### Why re-quantizing non-LoRA layers broke WebGPU
61
-
62
- The reference ONNX model was quantized by `onnx-community` using their official tooling. Our custom `quantize_to_matmulnbits()` function uses asymmetric uint4 quantization:
63
-
64
- ```python
65
- scales = (block_max - block_min) / 15.0
66
- zp = round(-block_min / scales)
67
- q = round(w / scales + zp).clip(0, 15)
68
- ```
69
-
70
- While mathematically correct, different implementations produce slightly different rounding for edge cases. The reference's quantizer may use a different rounding strategy, tie-breaking, or block boundary handling.
71
-
72
- On CPU (float32), these differences are negligible — the dequantized values are close enough. On WebGPU (float16 compute), the accumulated precision loss across 42 layers is enough to cause the model's internal state to diverge, particularly for rarely-exercised code paths like the `<think>` token processing.
73
-
74
- **Key insight**: Quantization is NOT commutative. `requantize(dequantize(reference_q4))` ≠ `reference_q4` even for identical weights. The reference's quantization produces specific rounding patterns that the model's behavior depends on at float16 precision.
75
-
76
- ---
77
-
78
- ## Conversion Guide: Nemotron-3 LoRA to ONNX Q4 for WebGPU
79
-
80
- ### Prerequisites
81
- - Fine-tuned merged model in safetensors format
82
- - Reference ONNX Q4 model from `onnx-community/NVIDIA-Nemotron-3-Nano-4B-BF16-ONNX`
83
- - Python with `onnx`, `safetensors`, `numpy`, `huggingface_hub`
84
- - Know which layers your LoRA modified (check `adapter_config.json` → `target_modules`)
85
-
86
- ### Step 1: Identify LoRA Target Layers
87
-
88
- ```python
89
- import json
90
- from huggingface_hub import hf_hub_download
91
-
92
- cfg = json.load(open(hf_hub_download('your-repo', 'adapter/adapter_config.json')))
93
- print(f"LoRA targets: {cfg['target_modules']}")
94
- # e.g., ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj']
95
- ```
96
-
97
- ### Step 2: Download Reference Q4 Model
98
-
99
- ```python
100
- from huggingface_hub import snapshot_download
101
-
102
- snap = snapshot_download(
103
- 'onnx-community/NVIDIA-Nemotron-3-Nano-4B-BF16-ONNX',
104
- allow_patterns='onnx/model_q4*'
105
- )
106
- ```
107
-
108
- ### Step 3: Copy Reference Files as Base
109
-
110
- ```python
111
- import shutil
112
- from pathlib import Path
113
-
114
- ONNX_Q4_BASE = Path(snap) / 'onnx'
115
- OUT_DIR = Path('/tmp/onnx-output/onnx')
116
- OUT_DIR.mkdir(parents=True, exist_ok=True)
117
-
118
- for f in ONNX_Q4_BASE.iterdir():
119
- if 'model_q4' in f.name:
120
- shutil.copy2(f, OUT_DIR / f.name)
121
- ```
122
-
123
- ### Step 4: Load Merged Safetensors
124
-
125
- ```python
126
- import torch
127
- from safetensors import safe_open
128
-
129
- st_tensors = {}
130
- with safe_open('models/merged/model.safetensors', framework='pt', device='cpu') as f:
131
- for k in f.keys():
132
- st_tensors[k] = f.get_tensor(k).float().numpy()
133
- ```
134
-
135
- ### Step 5: Name Mapping (ONNX ↔ Safetensors)
136
-
137
- Nemotron-H has a non-standard naming convention:
138
-
139
- | ONNX Name | Safetensors Name |
140
- |-----------|-----------------|
141
- | `model.layers.N.input_layernorm.weight` | `backbone.layers.N.norm.weight` |
142
- | `model.layers.N.mamba.{component}` | `backbone.layers.N.mixer.{component}` |
143
- | `model.layers.N.attn.{component}` | `backbone.layers.N.mixer.{component}` |
144
- | `model.layers.N.mlp.{component}` | `backbone.layers.N.mixer.{component}` |
145
- | `model.embed_tokens.weight` | `backbone.embedding.weight` |
146
- | `model.norm.weight` | `backbone.norm_f.weight` |
147
-
148
- For Q4 tensors, the ONNX names use underscores:
149
- - `model_layers_0_mamba_in_proj_MatMul_weight_quant` → `backbone.layers.0.mixer.in_proj.weight`
150
-
151
- ```python
152
- import re
153
-
154
- def map_float_name(n, st_names):
155
- """Map ONNX float tensor name to safetensors name."""
156
- if n.startswith('/') or any(x in n for x in [
157
- 'INT64','FLOAT','constants','expanded','unsqueezed',
158
- 'squeezed','neg_exp','f32','split_sizes']):
159
- return None
160
- m = n
161
- m = m.replace('model.embed_tokens.weight', 'backbone.embedding.weight')
162
- m = re.sub(r'^model\.norm\.weight$', 'backbone.norm_f.weight', m)
163
- m = re.sub(r'^model\.lm_head', 'lm_head', m)
164
- m = re.sub(r'^model\.layers\.(\d+)\.input_layernorm\.weight',
165
- r'backbone.layers.\1.norm.weight', m)
166
- m = re.sub(r'^model\.layers\.(\d+)\.pre_ff_layernorm\.weight',
167
- r'backbone.layers.\1.norm2.weight', m)
168
- m = m.replace('model.layers.', 'backbone.layers.')
169
- m = m.replace('.mamba.', '.mixer.').replace('.attn.', '.mixer.').replace('.mlp.', '.mixer.')
170
- m = re.sub(r'\.MatMul\.weight$', '.weight', m)
171
- return m if m in st_names else None
172
-
173
- def map_q4_base_to_st(base_name, st_names):
174
- """Map Q4 initializer base name to safetensors weight name."""
175
- m = re.sub(r'_weight$', '', base_name)
176
- if m == 'lm_head_MatMul':
177
- return 'lm_head.weight'
178
- if m == 'model_embed_tokens':
179
- return 'backbone.embedding.weight'
180
- match = re.match(r'model_layers_(\d+)_(mamba|attn|mlp)_(.+)_MatMul$', m)
181
- if not match:
182
- return None
183
- layer, sub, comp = match.groups()
184
- st_name = f'backbone.layers.{layer}.mixer.{comp}.weight'
185
- return st_name if st_name in st_names else None
186
- ```
187
-
188
- ### Step 6: Quantization Function
189
-
190
- ```python
191
- def quantize_to_matmulnbits(weight_f32, N, K, block_size=32):
192
- """Asymmetric uint4 block quantization matching ORT MatMulNBits format."""
193
- w = weight_f32.astype(np.float32)
194
- assert w.shape == (N, K)
195
-
196
- n_blocks = (K + block_size - 1) // block_size
197
- K_pad = n_blocks * block_size
198
- if K_pad > K:
199
- w = np.pad(w, ((0, 0), (0, K_pad - K)))
200
-
201
- w_blocks = w.reshape(N, n_blocks, block_size)
202
- block_min = w_blocks.min(axis=-1)
203
- block_max = w_blocks.max(axis=-1)
204
-
205
- scales = (block_max - block_min) / 15.0
206
- scales = np.where(scales == 0, 1.0, scales).astype(np.float32)
207
-
208
- zp_float = -block_min / scales
209
- zp = np.round(zp_float).clip(0, 15).astype(np.uint8)
210
-
211
- q = np.round(w_blocks / scales[:, :, np.newaxis] + zp[:, :, np.newaxis])
212
- q = q.clip(0, 15).astype(np.uint8)
213
-
214
- # Pack two nibbles per byte (low nibble first)
215
- q_pairs = q.reshape(N, n_blocks, block_size // 2, 2)
216
- packed = (q_pairs[..., 0] | (q_pairs[..., 1] << 4)).astype(np.uint8)
217
-
218
- # Pack zero points as nibbles
219
- n_zp_pairs = (n_blocks + 1) // 2
220
- zp_packed = np.zeros((N, n_zp_pairs), dtype=np.uint8)
221
- for i in range(n_blocks):
222
- byte_idx = i // 2
223
- if i % 2 == 0:
224
- zp_packed[:, byte_idx] |= zp[:, i]
225
- else:
226
- zp_packed[:, byte_idx] |= (zp[:, i] << 4)
227
-
228
- return packed, scales, zp_packed
229
- ```
230
-
231
- ### Step 7: Patch Weights (LoRA Targets Only!)
232
-
233
- **CRITICAL**: Only re-quantize layers that LoRA modified. Keep reference weights for everything else.
234
-
235
- ```python
236
- import onnx
237
- from onnx.external_data_helper import ExternalDataInfo
238
-
239
- LORA_TARGETS = {'q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj'}
240
-
241
- model = onnx.load(str(OUT_DIR / 'model_q4.onnx'), load_external_data=False)
242
-
243
- # Get MatMulNBits attributes
244
- matmul_attrs = {}
245
- for node in model.graph.node:
246
- if node.op_type == 'MatMulNBits' and len(node.input) >= 2:
247
- d = {a.name: a.i for a in node.attribute}
248
- matmul_attrs[node.input[1]] = (d.get('N', 0), d.get('K', 0), d.get('block_size', 32))
249
-
250
- # Patch float tensors (layernorms, conv1d, biases — always safe to patch)
251
- for init in model.graph.initializer:
252
- if any(init.name.endswith(s) for s in ['_quant', '_scales', '_zp']):
253
- continue
254
- st_name = map_float_name(init.name, set(st_tensors.keys()))
255
- if st_name is None:
256
- continue
257
- ext = ExternalDataInfo(init)
258
- if not ext.location:
259
- continue
260
- our_arr = st_tensors[st_name]
261
- onnx_dtype = {1: np.float32, 10: np.float16}.get(init.data_type, np.float32)
262
- our_bytes = our_arr.astype(onnx_dtype).tobytes()
263
- if ext.length and len(our_bytes) != ext.length:
264
- continue
265
- with open(OUT_DIR / ext.location, 'r+b') as f:
266
- f.seek(ext.offset or 0)
267
- f.write(our_bytes)
268
-
269
- # Patch Q4 tensors — ONLY LoRA targets
270
- q4_groups = {}
271
- for init in model.graph.initializer:
272
- if any(init.name.endswith(s) for s in ['_quant', '_scales', '_zp']):
273
- base = re.sub(r'_(quant|scales|zp)$', '', init.name)
274
- q4_groups.setdefault(base, {})[init.name[len(base)+1:]] = init
275
-
276
- for base_name, group in q4_groups.items():
277
- if 'quant' not in group:
278
- continue
279
-
280
- # SKIP non-LoRA layers — keep reference weights!
281
- if not any(target in base_name for target in LORA_TARGETS):
282
- continue
283
-
284
- st_name = map_q4_base_to_st(base_name, set(st_tensors.keys()))
285
- if st_name is None:
286
- continue
287
-
288
- quant_init = group['quant']
289
- N, K, bs = matmul_attrs.get(quant_init.name, (0, 0, 32))
290
- if N == 0:
291
- q_dims = list(quant_init.dims)
292
- if len(q_dims) == 3:
293
- N, n_b, half_bs = q_dims
294
- bs = half_bs * 2
295
- K = n_b * bs
296
-
297
- weight = st_tensors[st_name]
298
- if weight.shape == (K, N):
299
- weight = weight.T
300
-
301
- packed, scales, zp_packed = quantize_to_matmulnbits(weight, N, K, bs)
302
-
303
- # Write quant, scales, zp
304
- for suffix, data in [('quant', packed), ('scales', scales), ('zp', zp_packed)]:
305
- if suffix not in group:
306
- continue
307
- ext = ExternalDataInfo(group[suffix])
308
- data_bytes = data.tobytes()
309
- if ext.length and len(data_bytes) == ext.length:
310
- with open(OUT_DIR / ext.location, 'r+b') as f:
311
- f.seek(ext.offset or 0)
312
- f.write(data_bytes)
313
- ```
314
-
315
- ### Step 8: Upload to HuggingFace
316
-
317
- ```python
318
- from huggingface_hub import HfApi
319
-
320
- api = HfApi()
321
- api.upload_folder(
322
- folder_path=str(OUT_DIR),
323
- path_in_repo="onnx",
324
- repo_id="your-repo",
325
- repo_type="model",
326
- commit_message="LoRA-only Q4 patch from reference base"
327
- )
328
- ```
329
-
330
- ### Step 9: Set Model Config
331
-
332
- Ensure these files exist in your HF repo:
333
-
334
- **config.json** — must include:
335
- ```json
336
- {
337
- "transformers.js_config": {
338
- "use_external_data_format": {
339
- "model_q4.onnx": 2
340
- }
341
- }
342
- }
343
- ```
344
-
345
- **generation_config.json** — must include:
346
- ```json
347
- {
348
- "eos_token_id": [2, 11]
349
- }
350
- ```
351
-
352
- **tokenizer.json** + **tokenizer_config.json** — use the reference model's tokenizer (same vocab, includes `chat_template`).
353
-
354
- ### Step 10: Test with WebGPU Test Page
355
-
356
- Create a standalone HTML test page (see `dist/test-webgpu.html` in the Space repo) that:
357
- 1. Loads the model with `pipeline('text-generation', modelId, { dtype: 'q4', device: 'webgpu' })`
358
- 2. Tests both `enable_thinking: true` and `false`
359
- 3. Checks for `</think>` in output
360
- 4. Reports chunk count and content
361
-
362
- ---
363
-
364
- ## Key Lessons
365
-
366
- ### 1. Never re-quantize unchanged layers
367
- If LoRA only touches attention/MLP projections, keep the reference's Q4 weights for Mamba, embedding, and other untouched layers. Re-quantization introduces precision differences that break WebGPU inference.
368
-
369
- ### 2. WebGPU ≠ CPU for quantized models
370
- Float16 compute on WebGPU amplifies tiny quantization differences. Always test on actual WebGPU hardware, not just CPU/WASM.
371
-
372
- ### 3. The reference model is your ground truth
373
- Start from the reference ONNX and make minimal changes. Compare behavior at every step.
374
-
375
- ### 4. Build a WebGPU test harness early
376
- A standalone HTML test page that runs `enable_thinking: true/false` with both models saves hours of manual testing.
377
-
378
- ### 5. Config files matter
379
- Missing `transformers.js_config`, wrong `eos_token_id`, wrong `tokenizer.json` — each caused distinct failures. Use the reference model's configs as a template and only change what's necessary.
380
-
381
- ### 6. Browser cache is sticky
382
- `env.useBrowserCache = true` caches model files aggressively. When debugging, clear Cache Storage (not just regular cache) or use incognito mode.
383
-
384
- ---
385
-
386
- ## File Inventory
387
-
388
- | File | Source | Notes |
389
- |------|--------|-------|
390
- | `onnx/model_q4.onnx` | Reference | Graph structure (unchanged) |
391
- | `onnx/model_q4.onnx_data` | Mixed | Reference base + LoRA patches |
392
- | `onnx/model_q4.onnx_data_1` | Mixed | Reference base + LoRA patches |
393
- | `config.json` | Modified | Added `transformers.js_config` |
394
- | `generation_config.json` | Modified | `eos_token_id: [2, 11]` |
395
- | `tokenizer.json` | Reference | Must match ONNX vocab |
396
- | `tokenizer_config.json` | Reference | Includes `chat_template` |
397
-
398
- ## Scripts
399
-
400
- | Script | Purpose |
401
- |--------|---------|
402
- | `scripts/patch_q4_inplace.py` | Original (BROKEN) — re-quantizes all layers |
403
- | `scripts/patch_q4_loraonly.py` | Fixed — only patches LoRA target layers |
404
- | `dist/test-webgpu.html` | WebGPU test harness for both models |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/OPTION2_SFT_DISTILLATION_PLAN.md DELETED
@@ -1,193 +0,0 @@
1
- # Option 2: SFT Distillation Plan
2
-
3
- > Date: 2026-03-24
4
- > Status: Planning
5
-
6
- ## Overview
7
-
8
- Use Nemotron-4B as a **teacher** (via llama.cpp) to generate high-quality interviewer data. Train a **student** model (standard transformer, no Mamba-2) on that data via SFT. Deploy the student on GB10.
9
-
10
- ## Why This Works
11
-
12
- | Component | Tool | SM 12.1 | Status |
13
- |-----------|------|---------|--------|
14
- | Teacher generation | llama.cpp | ✅ | Proven — 4.35/5 eval score |
15
- | Reward filtering | Python (CPU) | ✅ | No GPU needed |
16
- | Student training | PyTorch + HF | ✅ | Standard transformer, no custom kernels |
17
- | Student inference | llama.cpp or vLLM | ✅ | Both work on GB10 |
18
-
19
- No mocks. No CUDA kernel workarounds. Every component is production-ready on GB10.
20
-
21
- ## Pipeline
22
-
23
- ```
24
- Phase 1: Data Generation
25
- Nemotron-4B (llama.cpp, BF16 GGUF)
26
- + Interview prompts from dataset (7,580 segments)
27
- + System prompt: "You are Lex Fridman, an AI interviewer"
28
- → Generate 5-10 completions per prompt
29
- → ~50K raw completions
30
-
31
- Phase 2: Filtering
32
- Score each completion with reward function
33
- + Question quality (asks insightful questions?)
34
- + Brevity (concise, not lecturing?)
35
- + Relevance (follows conversation?)
36
- + Style (sounds like an interviewer?)
37
- → Keep completions scoring ≥ 4/5
38
- → Target: 5K-10K high-quality examples
39
-
40
- Phase 3: Student Training
41
- Pick student model (see model selection below)
42
- SFT on filtered dataset
43
- → Standard HF Trainer, no special kernels needed
44
-
45
- Phase 4: Evaluation
46
- Run eval suite against student model
47
- Compare with Nemotron-4B base (4.35/5 target)
48
- Iterate on data quality / student model
49
- ```
50
-
51
- ## Model Selection for Student
52
-
53
- ### Criteria
54
- - Standard transformer architecture (no Mamba, no custom CUDA)
55
- - ~4B parameters (fits on GB10 with room for training)
56
- - Good instruction-following base
57
- - Works with PyTorch on SM 12.1
58
-
59
- ### Candidates
60
-
61
- | Model | Params | Arch | GRPO-trainable on GB10? | Notes |
62
- |-------|--------|------|------------------------|-------|
63
- | Qwen3.5-4B | 4B | Transformer | ✅ Yes (tested vLLM + PyTorch) | MoE routing sensitive to quant |
64
- | Qwen3.5-3B-A0.6B | 3B (0.6B active) | MoE | ✅ Yes | Very fast inference |
65
- | Llama-3.2-3B | 3B | Transformer | ✅ Yes | Proven architecture |
66
- | Gemma-3-4B | 4B | Transformer | ✅ Yes | Strong multilingual |
67
- | Phi-4-mini (3.8B) | 3.8B | Transformer | ✅ Yes | Good reasoning |
68
-
69
- **Recommendation**: Start with **Qwen3.5-4B** — same size as Nemotron, standard transformer, already tested on our eval suite (scored 3.55/5 at Q8, potential to improve with targeted SFT).
70
-
71
- ### Why Not Just SFT Nemotron?
72
- We already tried SFT on Nemotron-4B directly (v1-v5). Best score: 3.20/5. The Mamba-2 architecture made LoRA ineffective (only 4 attention layers got meaningful updates). SFT distillation to a standard transformer lets LoRA/full-finetune work on ALL layers.
73
-
74
- ## Data Generation Details
75
-
76
- ### System Prompt
77
- ```
78
- You are an expert AI interviewer in the style of Lex Fridman. You ask
79
- thoughtful, probing questions that explore deep ideas. Your questions are:
80
- - Concise (under 50 words)
81
- - Open-ended (encourage the guest to think deeply)
82
- - Build on what the guest just said
83
- - Occasionally surprising or from unexpected angles
84
- Do not lecture. Do not summarize. Just ask the next question.
85
- ```
86
-
87
- ### Prompt Format
88
- Each prompt is a conversation turn from our interview dataset:
89
- ```
90
- Guest: [previous guest response from interview_segments_v2.jsonl]
91
- Interviewer:
92
- ```
93
-
94
- ### Generation Parameters
95
- ```yaml
96
- temperature: 0.8 # some diversity
97
- top_p: 0.95
98
- max_tokens: 200
99
- n_per_prompt: 8 # 8 completions per prompt for diversity
100
- ```
101
-
102
- ### Reward Function (for filtering)
103
- ```python
104
- def score_completion(prompt, completion):
105
- score = 0
106
-
107
- # Must ask a question
108
- if "?" in completion: score += 1
109
-
110
- # Brevity (under 50 words is excellent)
111
- words = len(completion.split())
112
- if words < 30: score += 1.5
113
- elif words < 50: score += 1.0
114
- elif words < 80: score += 0.5
115
- elif words > 150: score -= 1.0
116
-
117
- # No template/meta patterns
118
- if not any(p in completion.lower() for p in
119
- ["用户问", "user asks", "the user", "question:", "as an ai"]):
120
- score += 0.5
121
-
122
- # Ends with a question
123
- if completion.strip().rstrip().endswith("?"):
124
- score += 1.0
125
-
126
- # Single question (not a list)
127
- if completion.count("?") <= 2:
128
- score += 0.5
129
-
130
- # Doesn't start with meta-commentary
131
- first_word = completion.strip().split()[0].lower() if completion.strip() else ""
132
- if first_word not in ["sure", "great", "absolutely", "definitely", "certainly"]:
133
- score += 0.5
134
-
135
- return min(5.0, max(0.0, score))
136
- ```
137
-
138
- **Keep threshold: score ≥ 3.5** (top ~30-40% of completions)
139
-
140
- ## Training Details
141
-
142
- ### SFT Configuration
143
- ```yaml
144
- model: Qwen/Qwen3.5-4B # or chosen student
145
- epochs: 3
146
- batch_size: 4
147
- learning_rate: 2e-5
148
- max_length: 512
149
- warmup_steps: 100
150
- weight_decay: 0.01
151
- lora:
152
- rank: 128
153
- alpha: 256
154
- target: all-linear
155
- ```
156
-
157
- ### Estimated Resources
158
- - Data generation: ~50K completions × 200 tokens × ~30 tok/s = ~90 hours (can parallelize)
159
- - OR: reduce to 10K completions = ~18 hours
160
- - Filtering: minutes (CPU)
161
- - SFT training: ~2-4 hours on GB10 (proven from prior runs)
162
-
163
- ### Speedup: Reduce Generation Scope
164
- Instead of all 7,580 prompts × 8 generations = 60K:
165
- - **Phase 1**: 500 prompts × 8 = 4,000 completions (~7 hours)
166
- - Filter → ~1,500 high-quality examples
167
- - SFT → evaluate
168
- - **Phase 2**: If promising, scale to full dataset
169
-
170
- ## Success Criteria
171
-
172
- - Student model eval score ≥ 4.0/5 (approaching Nemotron base's 4.35)
173
- - Completion quality: concise questions, no lecturing
174
- - Consistent interviewer persona
175
-
176
- ## Timeline
177
-
178
- | Phase | Time | Output |
179
- |-------|------|--------|
180
- | Generate 4K completions | ~7 hours | raw completions |
181
- | Filter + format | ~30 min | SFT dataset |
182
- | SFT training | ~3 hours | trained student |
183
- | Evaluation | ~1 hour | eval scores |
184
- | **Total Phase 1** | **~12 hours** | **first result** |
185
-
186
- ## Risks
187
-
188
- | Risk | Mitigation |
189
- |------|------------|
190
- | Student can't match teacher quality | Try multiple student models; increase data |
191
- | Teacher generates repetitive data | High temperature + diverse prompts |
192
- | Reward function too noisy | Manual review of 100 samples first |
193
- | SFT overfits to teacher quirks | Early stopping, validation split |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/RETROSPECTIVE_2026-03-31.md DELETED
@@ -1,168 +0,0 @@
1
- # Lex Fridman Interviewer — Data-Driven Retrospective
2
- *2026-03-31 | All experiments evaluated on held-out eval (n=25 prompts)*
3
-
4
- ---
5
-
6
- ## Final Leaderboard (functional judge: on_topic × uses_guest × probing)
7
-
8
- | Rank | Model | Score | Δbase | on_topic | uses_guest | probing | Words |
9
- |------|-------|-------|-------|----------|------------|---------|-------|
10
- | 🥇 | **GRPO v12** (LR=5e-6, cosine, 200 steps) | **0.760** | +0.107 | 72% | 60% | 96% | 15.2 |
11
- | 2 | GRPO v13 (LR=2e-5, constant, 300 steps) | 0.773 | +0.120 | 88% | 52% | 92% | 16.4 |
12
- | 3 | LoRA v1 (r=64, LR=2e-4, 1ep) | 0.733 | +0.080 | 72% | 56% | 92% | 14.8 |
13
- | 4 | GRPO v14 (LR=1e-5, constant, 300 steps) | 0.707 | +0.054 | 68% | 52% | 92% | 15.5 |
14
- | 5 | Base Nemotron 4B | 0.653 | — | 64% | 48% | 84% | 14.5 |
15
- | 6 | SFT v5 (LoRA r≈16, 3ep) | 0.667 | +0.014 | 76% | 60% | 64% | 15.2 |
16
-
17
- **Best balanced model: GRPO v12 (0.760)** — highest uses_guest (60%) + probing (96%).
18
- GRPO v13 has higher raw score (0.773) via on_topic (+16pp) but uses_guest regressed.
19
-
20
- ---
21
-
22
- ## Math: Score Decomposition
23
-
24
- The functional score is a per-prompt product:
25
- ```
26
- score_i = on_topic_i × uses_guest_i × probing_i ∈ {0, 1/3, 2/3, 1}
27
- Score = mean(score_i) ∈ [0, 1]
28
- ```
29
-
30
- Under **independence assumption**, `E[score] ≈ P(OT) × P(UG) × P(PR)`.
31
- But the observed score / product ratio reveals **co-occurrence structure**:
32
-
33
- | Model | Observed | P(OT)×P(UG)×P(PR) | Ratio | Meaning |
34
- |-------|----------|-------------------|-------|---------|
35
- | Base | 0.653 | 0.258 | **2.53** | Strong co-occurrence |
36
- | SFT v5 | 0.667 | 0.292 | **2.28** | Strong co-occurrence |
37
- | LoRA v1 | 0.733 | 0.371 | **1.98** | Moderate co-occurrence |
38
- | GRPO v12 | 0.760 | 0.415 | **1.83** | Near-independence |
39
- | GRPO v13 | 0.773 | 0.421 | **1.84** | Near-independence |
40
- | GRPO v14 | 0.707 | 0.325 | **2.17** | Moderate co-occurrence |
41
-
42
- **Insight**: Ratio converging toward 1.0 means the three dimensions are becoming more
43
- independent per-prompt. GRPO v12/v13 are closest to the theoretical maximum given their
44
- marginal rates. There is minimal slack left from co-occurrence effects — further gains
45
- must come from improving the marginal rates themselves.
46
-
47
- ---
48
-
49
- ## Math: Conditional Probabilities
50
-
51
- **Key structural finding:**
52
- ```
53
- P(UG=T | OT=T) >> P(UG=T | ¬OT=T) for ALL models
54
- ```
55
-
56
- | Model | P(UG | OT) | P(UG | ¬OT) | Lift | UG baseline |
57
- |-------|-----------|------------|------|-------------|
58
- | Base | 68.75% | 11.11% | 1.43× | 48% |
59
- | LoRA v1 | 72.22% | 14.29% | 1.29× | 56% |
60
- | GRPO v12 | **77.78%** | 14.29% | 1.30× | 60% |
61
- | GRPO v13 | 59.09% | 0.00% | 1.14× | 52% |
62
- | GRPO v14 | 76.47% | 0.00% | 1.47× | 52% |
63
-
64
- **The OT→UG transition is the only path.** No model achieves UG=True when OT=False.
65
- This means: improving uses_guest requires first getting on-topic, then adding specificity.
66
- Training that gains OT (v13) but loses UG has taught the model to be on-topic generically.
67
-
68
- ---
69
-
70
- ## GRPO Training Signal Analysis
71
-
72
- ### Formula: GRPO gradient signal strength
73
-
74
- ```
75
- signal_proxy = LR × mean(reward_std)
76
-
77
- GRPO advantage for completion i:
78
- A_i = (r_i - mean(r_j)) / std(r_j)
79
-
80
- Policy gradient update:
81
- ΔW ∝ LR × Σ A_i × ∇log P(completion_i | prompt)
82
- ```
83
-
84
- | Run | LR | mean(reward_std) | signal_proxy | reward Δ | eval Δ | efficiency |
85
- |-----|-----|-----------------|--------------|----------|--------|------------|
86
- | GRPO v12 | 5e-6 | 0.158 | **7.9e-7** | +0.113 | **+0.027** | 0.24 |
87
- | GRPO v13 | 2e-5 | 0.188 | **3.8e-6** | +0.088 | +0.013 | 0.15 |
88
- | GRPO v14 | 1e-5 | 0.173 | **1.7e-6** | +0.092 | −0.053 | **−0.58** |
89
-
90
- **Signal-to-eval efficiency (eval Δ / reward Δ)** reveals how well reward improvements
91
- translate to actual quality gains:
92
- - v12: 0.24 — best efficiency despite lowest signal strength
93
- - v13: 0.15 — 4× stronger signal, but less efficient (LR overshot uses_guest)
94
- - v14: −0.58 — reward improved but eval degraded (reward_v13 misaligned)
95
-
96
- ### v14 paradox: entropy ↑ but eval ↓
97
-
98
- GRPO v14 showed the largest entropy increase (+0.72 nats, +46%) and KL decrease (−0.69).
99
- This combination means: model explored more diverse phrasings AND moved back toward the
100
- reference distribution. Yet eval dropped from 0.760 → 0.707.
101
-
102
- **Diagnosis**: reward_v13 (geometric mean OT×UG×PR) was harder to optimize than reward_v12.
103
- The geometric mean penalizes ANY weak dimension equally — if a rollout has OT=high but
104
- UG=medium and PR=high, the geometric mean pulls the reward down more than v12's
105
- asymmetric formula would. The model's entropy increased (exploring) but the exploration
106
- wasn't converting to better joint OT∧UG outcomes.
107
-
108
- ---
109
-
110
- ## Per-Prompt Analysis: Hard Ceiling
111
-
112
- Across all 6 models on 25 prompts:
113
- - **18/25 prompts (72%)** have been answered with OT∧UG=True by at least one model
114
- - **7/25 prompts (28%)** are structurally hard — no model has ever achieved OT∧UG=True
115
-
116
- Hard prompts share a pattern: vague/content-light guest statements where there's
117
- insufficient specific content to reference ("weasel and furry coats", "no wars when
118
- I was president", "that's right exactly right").
119
-
120
- **Theoretical ceiling**: If we achieve OT∧UG=True on all 18 reachable prompts with
121
- probing=92%, the max score is: `18/25 × 0.92 + 7/25 × 0 ≈ 0.662`. Yet GRPO v12
122
- achieves 0.760 — above this. The discrepancy is explained by the ratio analysis: the
123
- actual score includes probing-only wins (OT=F, UG=F, PR=T) which add partial credit.
124
-
125
- **Corrected ceiling** (with probing-only partial credit):
126
- ```
127
- max_score ≈ P(all 3) × 1.0 + P(exactly 2) × (2/3) + P(exactly 1) × (1/3)
128
- ```
129
- GRPO v12 is near-optimal for its current marginal rates.
130
-
131
- ---
132
-
133
- ## Why uses_guest Plateaus at 52-60%
134
-
135
- The OT∧UG joint rate has been stuck at 13-14/25 across all GRPO runs:
136
-
137
- | Model | OT∧UG joint | Gained (vs base) | Lost (vs base) |
138
- |-------|------------|-----------------|----------------|
139
- | LoRA v1 | 13/25 | +5 prompts | −3 prompts |
140
- | GRPO v12 | **14/25** | +5 prompts | −2 prompts |
141
- | GRPO v13 | 13/25 | +4 prompts | −2 prompts |
142
- | GRPO v14 | 13/25 | +5 prompts | −3 prompts |
143
-
144
- All GRPO runs gain exactly 4-5 prompts and lose 2-3. The **net is always +1 to +2**.
145
- This is a hard architectural constraint: the 4 attention layers (LoRA targets, 1% of params)
146
- can shift the marginal distribution by ~1 prompt worth of improvement per run.
147
-
148
- The 38 frozen Mamba layers hold the deeper "interview style" patterns. To break past
149
- 14/25 OT∧UG, we would need either:
150
- 1. Full fine-tuning (all 38 Mamba layers + 4 attention)
151
- 2. More LoRA capacity (r=256 or above)
152
- 3. A fundamentally different base model
153
-
154
- ---
155
-
156
- ## Conclusion
157
-
158
- **Best model for deployment: GRPO v12 (score=0.760)**
159
-
160
- The progression LoRA v1 → GRPO v12 → (plateau) tells a clear story:
161
- - LoRA v1 taught the style (SFT signal)
162
- - GRPO v12 refined the balance (RL signal, low LR, let reward accumulate)
163
- - Further GRPO iterations hit the 1% LoRA architectural ceiling
164
-
165
- The reward function is not the limiting factor. The training signal is not the limiting
166
- factor. The **LoRA rank and frozen Mamba architecture** are.
167
-
168
- To improve beyond 0.760 would require r=256+ LoRA or full fine-tuning — a different experiment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/REWARD_V10_DESIGN.md DELETED
@@ -1,175 +0,0 @@
1
- # Reward v10 Design Doc — Log-Ratio + NLI Hybrid
2
-
3
- **Status:** Design phase — experiments pending
4
- **Goal:** Verifiable, ungameable reward for Lex Fridman interviewer style transfer
5
- **Decision threshold:** Signal strength test must pass before v9 launch
6
-
7
- ---
8
-
9
- ## Architecture
10
-
11
- ```
12
- reward(q, guest) = hard_gates(q)
13
- × (1.0 + r_depth + r_lex + r_brevity + r_specificity)
14
-
15
- Where:
16
- r_depth ← NLI entailment score (0–1.5)
17
- r_lex ← SFT log-ratio (0–1.5, clipped)
18
- r_brevity ← word count score (0–0.5)
19
- r_specificity ← entity overlap with guest (0–0.5)
20
- Total range ← 0.0 (gated) or 1.0–5.0
21
- ```
22
-
23
- ---
24
-
25
- ## Component 1: Hard Gates (binary, pre-filter)
26
-
27
- Fast regex checks. Instant 0.0 if ANY fires.
28
-
29
- | Gate | Pattern | Rationale |
30
- |------|---------|-----------|
31
- | Stage directions | `^\*\(`, `^\[Lex` | Model never generates these in real interviews |
32
- | Meta-commentary | `as lex fridman`, `as an interviewer` | Identity collapse |
33
- | Filler openers | `that's fascinating`, `great point` | Lex never starts with filler |
34
- | Ultra-generic | `what are your thoughts on that` | Not specific to guest |
35
- | Not a question | doesn't end with `?` | Lex always asks |
36
- | Too many `?` | count > 2 | Signals multiple weak questions |
37
-
38
- **Assumption:** These patterns are unambiguous and cannot be worked around while still producing good questions.
39
-
40
- **Test:** Run gates on 200 random Lex real questions → expect <1% false positive rate.
41
-
42
- ---
43
-
44
- ## Component 2: NLI Depth (r_depth, 0–1.5)
45
-
46
- Model: `cross-encoder/nli-deberta-v3-small` (frozen, 86MB)
47
-
48
- ```
49
- premise = guest_answer
50
- hypothesis = candidate_question
51
-
52
- entailment → r_depth = max(0, 0.5 - conf × 0.8) # shallow restate
53
- neutral → r_depth = 0.5 + conf × 0.5 # goes beyond
54
- contradiction → r_depth = 0.8 # strong contrast
55
- ```
56
-
57
- **Assumption:** A question that is NOT entailed by the guest statement requires genuinely new thinking.
58
-
59
- **Weakness:** Neutral class is broad — a random unrelated question also scores neutral.
60
- **Mitigation:** Specificity score (below) penalizes unrelated questions.
61
-
62
- **Test:** Score 50 real Lex questions vs 50 generic questions. Expected: real Lex mean > generic mean by >0.3.
63
-
64
- ---
65
-
66
- ## Component 3: SFT Log-Ratio (r_lex, 0–1.5)
67
-
68
- **What it is:**
69
- Train a small SFT model on (guest_context → Lex_question) pairs for 1 epoch.
70
- Then: `r_lex = clip(sft_lp(q|ctx) - base_lp(q|ctx), 0, 1.5)`
71
-
72
- **Why log-ratio vs raw SFT logprob:**
73
- Raw logprob favors short/common sequences regardless of quality.
74
- Log-ratio normalizes by the base model → measures "Lex-specific" signal.
75
-
76
- **Assumptions:**
77
- 1. SFT model learns a distribution shift toward Lex style
78
- 2. `sft_lp - base_lp > 0` reliably for real Lex questions
79
- 3. The signal is large enough to be useful (>0.1 per sequence)
80
-
81
- **Tests to run:**
82
- - A. False positive rate: does log-ratio correctly score bad questions low?
83
- - B. Signal strength: mean(sft_lp - base_lp) for real Lex vs generic questions
84
- - C. Mode coverage: does the SFT model assign high prob to diverse good questions (not just mode)
85
-
86
- **Risk:** If SFT only shifts by 0.01 nats/token after 1 epoch, signal is below noise → useless.
87
- **Mitigation:** Test first; if weak, train 3 epochs or use a Qwen 0.5B for the SFT model.
88
-
89
- ---
90
-
91
- ## Component 4: Specificity (r_specificity, 0–0.5)
92
-
93
- Word overlap between question and guest statement using content words (>5 chars, non-stopwords).
94
-
95
- ```
96
- specific_overlap = |{w: len(w)>5} ∩ guest_words| / |{w: len(w)>5} ∩ guest_words|
97
- r_specificity = min(specific_overlap × 2.0, 0.5)
98
- ```
99
-
100
- **Assumption:** A question that references the guest's specific terminology is more contextually grounded.
101
-
102
- **Weakness:** Penalizes questions that pivot to a NEW topic not in guest's statement (sometimes Lex's best moves).
103
- **Mitigation:** Keep weight low (0.5 max) so pivot questions still score well on NLI.
104
-
105
- ---
106
-
107
- ## Component 5: Brevity (r_brevity, 0–0.5)
108
-
109
- ```
110
- 10–40 words → +0.5
111
- 8–60 words → +0.25
112
- >100 words → −0.3
113
- ```
114
-
115
- **Assumption:** Lex's best questions are concise. Verbosity ≠ quality.
116
-
117
- ---
118
-
119
- ## Experiment Plan (small scale)
120
-
121
- ### Exp 1: Hard Gate False Positive Rate
122
- - Input: 200 real Lex questions from transcripts
123
- - Expected: <2 false positives (>99% pass rate)
124
- - Failure: >5% false positive → gates need loosening
125
-
126
- ### Exp 2: NLI Discrimination
127
- - Input: 50 real Lex questions + 50 ChatGPT generic follow-ups, same guest statements
128
- - Metric: Mann-Whitney U test on NLI scores
129
- - Expected: p < 0.01, Lex mean 0.3+ higher
130
- - Failure: no significant separation → NLI not useful for this task
131
-
132
- ### Exp 3: SFT Model Signal Strength
133
- - Train: 1-epoch SFT on 10K (context, question) pairs from Lex transcripts
134
- - Input: 50 real Lex questions + 50 generic, per-token log-ratio
135
- - Metric: mean(log-ratio for Lex) vs mean(log-ratio for generic)
136
- - Expected: Lex mean > generic mean by >0.05 nats/token
137
- - Failure: <0.02 difference → signal too weak, train longer or use different base
138
-
139
- ### Exp 4: Gaming Resistance
140
- - After Exp 3: sample 32 completions from base model for 5 prompts
141
- - Run through reward_v10 (all components)
142
- - Check: does reward correctly rank the 32 completions?
143
- - Manually label top-5 and bottom-5 → does reward agree with human judgment?
144
-
145
- ### Exp 5: Mini GRPO (50 steps)
146
- - 10 prompts, 4 generations each, 50 steps
147
- - Reward: v10 (full combination)
148
- - Metric: reward trend + val_score at step 50
149
- - Expected: rewards climbing without obvious collapse patterns
150
- - Failure: collapse at step 20 → reward still gameable
151
-
152
- ---
153
-
154
- ## Tradeoff Summary
155
-
156
- | | Heuristic v8 | NLI v9 | Log-ratio | Combined v10 |
157
- |--|--|--|--|--|
158
- | **Ground truth** | No | Partially | Yes (real data) | Yes |
159
- | **Speed** | 0.0s | 0.5s | 0.1s | 0.6s |
160
- | **Gameable** | Yes (step 50) | Partially | Mode-seeking | Harder |
161
- | **Interpretable** | Yes | Partially | Yes | Yes |
162
- | **Circular** | No | No | No (frozen) | No |
163
- | **Effort to build** | Done | Done | 1h SFT + test | 2h |
164
-
165
- ---
166
-
167
- ## Go/No-Go Decision
168
-
169
- Proceed with v10 if Exp 2 + Exp 3 both pass.
170
- If Exp 3 fails (weak signal): drop log-ratio, proceed with NLI-only v9.
171
- If Exp 2 fails: reconsider NLI model or switch to sentence embedding distance.
172
-
173
- ---
174
-
175
- *Author: vexorium | Date: 2026-03-29*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/REWARD_V11_DESIGN.md DELETED
@@ -1,187 +0,0 @@
1
- # Reward v11 Design Doc — Simulated Response Information Gain
2
-
3
- **Status:** Design phase
4
- **Motivation:** v10 log-ratio is a stochastic parrot — rewards questions that sound like Lex
5
- rather than questions that *work* like Lex. A good interview question makes the guest
6
- say something they wouldn't have said otherwise. That's measurable.
7
-
8
- **Decision to pivot:** v10 val_score=6.70 at step 10 beat base (6.46), but the reward
9
- signal measures style match, not function. A model optimizing v10 converges on Lex's
10
- modal questions, not his best ones.
11
-
12
- ---
13
-
14
- ## Core Idea
15
-
16
- ```
17
- reward(q | guest_context) = does q make the guest elaborate beyond what they already said?
18
- ```
19
-
20
- Concretely:
21
- 1. Policy generates question Q given guest statement G
22
- 2. Frozen guest simulator generates response R to (G, Q)
23
- 3. Measure: how much NEW, RELEVANT content does R contain vs G?
24
-
25
- ---
26
-
27
- ## Information Gain Formula
28
-
29
- ```python
30
- novelty = 1 - cosine_sim(embed(R), embed(G)) # R goes beyond G
31
- relevance = cosine_sim(embed(R), embed(Q)) # R actually addresses Q
32
-
33
- info_gain = novelty × relevance
34
- ```
35
-
36
- **Why the product?**
37
- - Random/unrelated question: novelty HIGH (R goes somewhere new) but relevance LOW → near 0
38
- - Restatement question ("So you mean X?"): novelty LOW (R just confirms G) → near 0
39
- - Deep specific question: novelty HIGH AND relevance HIGH → high reward
40
-
41
- The product forces BOTH conditions simultaneously. Neither alone is sufficient.
42
-
43
- ---
44
-
45
- ## Guest Simulator
46
-
47
- **Model:** `Qwen/Qwen2.5-0.5B-Instruct` (already loaded, frozen)
48
-
49
- **Why a different model than the policy:**
50
- - Policy is Nemotron 4B (Mamba hybrid)
51
- - Guest sim is Qwen 0.5B — different architecture, different weights
52
- - No gradient connection between them
53
- - Frozen throughout training
54
-
55
- **Why it doesn't need to be good:**
56
- The guest sim only needs to respond DIFFERENTLY to different questions.
57
- GRPO compares questions within a group (leave-one-out advantage).
58
- If sim responds more elaborately to Q1 than Q2, Q1 scores higher — that's the signal.
59
- Absolute quality of simulated responses doesn't matter, relative ordering does.
60
-
61
- **Prompt format:**
62
- ```
63
- System: You are an expert being interviewed. Give a substantive, specific answer
64
- to the follow-up question based on what you just said.
65
- User: [guest_statement]
66
-
67
- Follow-up question: [question]
68
- Assistant: [response]
69
- ```
70
-
71
- **Generation params:** max_new_tokens=150, temperature=0.7, greedy fallback
72
-
73
- ---
74
-
75
- ## Embedding Model
76
-
77
- `sentence-transformers/all-MiniLM-L6-v2` (384-dim, already used in project)
78
- - Fast: ~0.1ms per embedding on GPU
79
- - Reasonable semantic similarity
80
-
81
- ---
82
-
83
- ## Full Reward
84
-
85
- ```python
86
- def reward_v11(q, guest, sim_response):
87
- # Hard gates (unchanged from v10)
88
- if not passes_hard_gates(q):
89
- return 0.0
90
-
91
- # Embeddings
92
- e_G = embed(guest) # original guest statement
93
- e_Q = embed(q) # question
94
- e_R = embed(sim_response) # simulated response
95
-
96
- # Core: information gain
97
- novelty = 1 - cosine_sim(e_R, e_G) # 0–1
98
- relevance = cosine_sim(e_R, e_Q) # 0–1
99
- info_gain = novelty * relevance # 0–1
100
-
101
- # Secondary: brevity, specificity (small weight)
102
- brevity = brevity_score(q) # 0–0.5
103
- specificity = specificity_score(q, guest) # 0–0.3
104
-
105
- # Final
106
- reward = 1.0 + info_gain * 2.5 + brevity + specificity # 1.0–4.3
107
- return clip(reward, 0.0, 5.0)
108
- ```
109
-
110
- ---
111
-
112
- ## Speed Analysis
113
-
114
- | Component | Time (32 completions) | Notes |
115
- |-----------|----------------------|-------|
116
- | Hard gates | ~0.01s | Regex |
117
- | Guest sim generation | ~8–15s | Qwen 0.5B, 150 tokens, batched |
118
- | Embeddings | ~0.5s | MiniLM, batched |
119
- | **Total overhead** | **~15s** | vs current 0s |
120
- | Step time impact | +8% | step currently ~150s |
121
-
122
- Acceptable.
123
-
124
- ---
125
-
126
- ## Assumptions
127
-
128
- 1. **Guest sim responds differently to different questions** — verified by design (causal LM)
129
- 2. **Embedding cosine distance captures "new information"** — approximate but consistent
130
- 3. **Product(novelty, relevance) is ungameable** — a model must produce questions that
131
- both go beyond G AND stay relevant to G; neither condition alone suffices
132
- 4. **Qwen 0.5B is stable as frozen reference** — no drift risk (frozen)
133
-
134
- ---
135
-
136
- ## Failure Modes
137
-
138
- | Risk | Mitigation |
139
- |------|-----------|
140
- | Guest sim always gives same length response | Check variance in response length across questions |
141
- | Embedding space doesn't capture deep vs shallow | Validate on known good/bad pairs (Exp 1) |
142
- | Sim response too short to embed meaningfully | Min length gate: sim_response >= 20 words |
143
- | Policy learns to ask confusing questions (high novelty, low relevance) | Relevance term in product prevents this |
144
-
145
- ---
146
-
147
- ## Experiment Plan
148
-
149
- ### Exp A: Guest Sim Variance Check
150
- - 5 guest statements × 5 questions each (mix good/bad)
151
- - Check: does sim_response vary meaningfully across questions?
152
- - Expected: std(len(responses)) > 20 tokens within same guest
153
-
154
- ### Exp B: Info Gain Discrimination
155
- - 30 real Lex questions vs 30 generic questions (same guests)
156
- - Run through full reward_v11
157
- - Expected: real Lex mean > generic mean (p < 0.05)
158
- - If fails: embedding model is wrong choice
159
-
160
- ### Exp C: Mini GRPO 50 steps
161
- - Same setup as v10 mini run
162
- - Watch for collapse and val_score trajectory
163
- - Success: val_score > 6.46 (base) at step 30 without collapse
164
-
165
- ---
166
-
167
- ## Go/No-Go
168
-
169
- - If Exp A fails: guest sim model needs to be larger or prompted differently
170
- - If Exp B fails: pivot embedding model or add NLI component
171
- - If both pass: launch full 500-step run
172
-
173
- ---
174
-
175
- ## What This Optimizes vs v10
176
-
177
- | | v10 (log-ratio) | v11 (info gain) |
178
- |--|--|--|
179
- | Optimizes for | Looking like Lex | Working like Lex |
180
- | Stochastic parrot? | Yes | No |
181
- | Grounded in? | Lex's distribution | Guest's elaboration |
182
- | Gameable by? | Mode-seeking | Nothing cheap |
183
- | Cost | 0.1s/step | +15s/step |
184
-
185
- ---
186
-
187
- *Author: vexorium | Date: 2026-03-29*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/REWARD_V13_DESIGN.md DELETED
@@ -1,106 +0,0 @@
1
- # Reward v13 Design — Guest-Grounded, Anti-Meta, Anti-Generic
2
-
3
- ## Why v13 exists
4
-
5
- GRPO v21 and v22 taught two different lessons:
6
-
7
- - **v21** had the best thinking-enabled eval score (**0.867**) but still suffered from clipped / meta-spill failures.
8
- - **v22** greatly reduced clipping, but final eval dropped to **0.813**.
9
-
10
- This suggests the bottleneck is no longer primarily generation length. The next bottleneck is the **reward geometry**.
11
-
12
- ## Failure modes observed
13
-
14
- ### v21 eval outputs
15
- - explicit meta-spill: **12%**
16
- - generic/opening-style questions: **8%**
17
-
18
- ### v22 eval outputs
19
- - explicit meta-spill: **4%**
20
- - generic/opening-style questions: **16%**
21
- - occasional off-domain drift remained
22
-
23
- Interpretation:
24
- - v22 solved more of the truncation problem
25
- - but shifted toward a cleaner, shorter, more generic local optimum
26
-
27
- ## Design goals
28
-
29
- 1. Keep the strong parts of v12:
30
- - `uses_guest`
31
- - `probing`
32
- 2. Penalize obvious meta/task-restatement failures
33
- 3. Penalize weak generic-opener templates when not lexically anchored to the guest
34
- 4. Add a **soft overthinking penalty**
35
- 5. Do **not** reward long hidden thinking directly
36
-
37
- ## Core formula
38
-
39
- ```python
40
- base = uses_guest**0.67 * probing**0.33
41
- reward = min(base + lexical_bonus, 1.0) * soft_penalty
42
- ```
43
-
44
- ### Base semantic reward
45
- - `uses_guest`: primary bottleneck
46
- - `probing`: secondary guardrail
47
-
48
- ### Lexical bonus
49
- Small additive bonus for vocabulary echo from the guest statement.
50
-
51
- ### Soft penalty
52
- Multiplicative penalty for:
53
- - explicit meta spill (`the user is asking...`, `I need to...`)
54
- - generic openers with weak lexical anchoring
55
- - obvious drift patterns
56
- - excessively long hidden thinking
57
-
58
- ## Why not reward long thinking?
59
-
60
- Because v22 already showed that reducing clipping and allowing longer scratchpads did **not** automatically improve the final interviewer policy. Long thinking helps only if it improves the visible question. Therefore:
61
-
62
- - allow sufficient budget (`MAX_NEW_TOKENS=2560`, `MAX_SEQ=4096`)
63
- - but do not give extra reward for consuming it
64
- - instead, mildly discourage very long hidden reasoning once it goes past a healthy band
65
-
66
- ## Soft overthinking penalty schedule
67
-
68
- Current intended schedule (based on hidden think token count):
69
- - <= 900 tokens: no penalty
70
- - > 900: ×0.95
71
- - > 1400: ×0.85
72
- - > 2000: ×0.70
73
-
74
- This is intentionally soft. The goal is not to suppress reasoning, only to discourage pathological rambling.
75
-
76
- ## Proposed run policy
77
-
78
- For the next experiment, prefer a **clean reward ablation**:
79
- - same generation budget as **GRPO v21**
80
- - `MAX_NEW_TOKENS=1600`
81
- - `MAX_SEQ=3072`
82
- - start from **GRPO v21**, because v21 is still the best policy checkpoint
83
-
84
- This isolates the reward change better than changing both reward and token budget at once.
85
-
86
- ## Challenge to the current plan
87
-
88
- The main temptation is to assume that the reward is being "hacked by long thinking" and therefore to either:
89
- - reward long thinking more, or
90
- - punish it aggressively
91
-
92
- Both are likely wrong.
93
-
94
- The better interpretation is:
95
- - long thinking changed the sampling dynamics
96
- - the reward still accepted a too-generic local optimum
97
- - therefore the next fix should be **better answer-side reward shaping**, not more hidden-thought optimization
98
-
99
- ## Recommendation
100
-
101
- Best next experiment:
102
- - `reward_v13`
103
- - same generation length as **v21**
104
- - start from `grpo-v21`
105
- - keep clip penalty
106
- - compare directly against v21 under thinking-enabled eval
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/RL_VS_FILTERING_ANALYSIS_2026-03-30.md DELETED
@@ -1,107 +0,0 @@
1
- # RL vs Data Filtering — First Principles Analysis
2
- *2026-03-30*
3
-
4
- ## The Question
5
- Can RL (GRPO) address the template contamination problem better than filtering the training dataset?
6
-
7
- ## Root Cause (Recap)
8
- Template contamination: 30% of generated training data uses generic openers
9
- ("How do you think/reconcile/balance...") that require no vocabulary echo.
10
- Real Lex uses these only 2% of the time. The model learned a prior:
11
- `P(first_token='How' | interview_context)` is very high.
12
-
13
- ## Why RL Is Genuinely Better at This
14
-
15
- ### Where the template is stored
16
- The prior `P("How do you"|context)` lives in the embedding + early Mamba-2 SSM layers.
17
- These carry the autoregressive state that says "I'm in an interview, generate a question."
18
-
19
- ### SFT (filtering) mechanism
20
- - Cross-entropy on (guest, question) pairs
21
- - Updates 4 attention LoRA weights only (1.01% of params)
22
- - The Mamba SSM state that GENERATES "How" is **unchanged**
23
- - Filtering reduces how many times templates appear in training targets
24
- - The prior shrinks slightly but persists — you can't subtract probability with cross-entropy
25
-
26
- ### GRPO mechanism
27
- ```
28
- Loss = -A_i × log P(question_i | guest)
29
- A_i = (r_i - mean(r)) / std(r)
30
- ```
31
- - Echo questions → A_i > 0 → P(echo tokens | guest) **increases**
32
- - Template questions → A_i < 0 → P("How do you think" | guest) **decreases**
33
- - This gradient flows through **all 42 layers** via backprop, including Mamba SSM
34
- - RL can SUBTRACT probability mass from templates — SFT cannot
35
-
36
- **Critical constraint**: GRPO weight updates still only apply to the 40.5M LoRA params.
37
- The gradient *signal* reaches all layers, but only attention weights *change*.
38
- So RL suppresses templates via attention modulation of Mamba output — indirect but effective.
39
-
40
- ### The generation variance evidence
41
- The model already generates non-template openers ~8% of the time:
42
- - "because you mentioned..."
43
- - "you mentioned X..."
44
- - "you say the..."
45
- - "you mentioned 'regulatory'..."
46
-
47
- These are the seeds RL needs to amplify. GRPO will:
48
- 1. Reward these when they appear (positive advantage)
49
- 2. Penalize "How do you think" when competing (negative advantage)
50
- 3. Shift the distribution from ~8% echo openers → ~30%+
51
-
52
- ## Why Both Together Is Optimal
53
-
54
- | Approach | What it fixes | What it can't fix |
55
- |----------|--------------|-------------------|
56
- | Filtering | Removes positive template training signal | Can't subtract existing prior |
57
- | RL (GRPO) | Actively penalizes templates, amplifies echoes | Can't directly rewrite Mamba SSM weights |
58
- | **Both** | Data-level + optimization-level suppression | — |
59
-
60
- **Ordering**: Filter first → LoRA v2 SFT → then GRPO from v2 checkpoint.
61
- - Filtered SFT reduces template prior (weakened starting point for RL)
62
- - RL has more variance in rollout groups when ~15-20% echo vs ~8% from LoRA v1
63
- - GRPO has stronger signal to amplify echoes and suppress templates
64
-
65
- ## Optimal Reward Function: reward_v12
66
-
67
- ```python
68
- reward = geometric_mean(logit_gap_uses_guest, logit_gap_probing) × hard_gate
69
-
70
- logit_gap = log P(YES) - log P(NO) # continuous, from Qwen3.5-4B judge
71
- ```
72
-
73
- **Why geometric mean of logits (not binary product)**:
74
- - Continuous → dense gradient at every step (binary gives 0 when all True)
75
- - Product structure → can't game one dimension alone
76
- - Geometric mean → equal weight, penalizes weakest link
77
- - `uses_guest × probing` specifically: fixes bottleneck (56%) while protecting strength (92%)
78
-
79
- **Why not all 3 judges in reward**:
80
- - on_topic at 72% is not the bottleneck; adding it makes reward sparser
81
- - Start with 2-judge reward, add on_topic if it degrades during training
82
-
83
- **Why not reward_v11 (info-gain)**:
84
- - Experimentally anti-correlated with uses_guest (-0.098) and probing (-0.244)
85
- - Off-topic question scored higher than specific on-topic ones (Exp 2)
86
- - Historical: GRPO v11 dropped uses_guest -8pp (Exp 3)
87
-
88
- ## Variance Analysis
89
- - Current LoRA v1: uses_guest=56%, probing=92%
90
- - Expected group variance (n=8): 0.20 (sufficient for GRPO)
91
- - uses_guest at 56% creates good rollout variance — some echo, some template
92
- - At 72%+ uses_guest, variance would shrink and RL signal weakens (natural convergence)
93
-
94
- ## Recommended Execution Plan
95
-
96
- ```
97
- Step 1: LoRA v2 SFT (filtered + upsampled, ~30 min)
98
- - Remove 1,247 generic-opener generated pairs
99
- - Upsample real Lex 6×
100
- - Expected: uses_guest 56% → ~65%
101
-
102
- Step 2: GRPO from LoRA v2 checkpoint (reward_v12, ~3h)
103
- - reward = geom_mean(logit_uses_guest, logit_probing) × gate
104
- - Starting point: 0.73+ with ~65% uses_guest
105
- - Expected: uses_guest 65% → 75%+, maintain probing ≥ 90%
106
- - Full score target: 0.80+
107
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/SSM_SCAN_FIX_PLAN.md DELETED
@@ -1,202 +0,0 @@
1
- # Plan: Implement Correct Mamba-2 SSD Scan for GB10
2
-
3
- > Date: 2026-03-25
4
- > Challenge: Replace NVIDIA's broken `torch_forward` with a correct pure-PyTorch SSD scan
5
-
6
- ## Resources Found
7
-
8
- ### 1. `vasqu/mamba2-torch` — Pure PyTorch Mamba-2 with correct SSD scan ⭐
9
- - **URL**: https://github.com/vasqu/mamba2-torch
10
- - **Key file**: `tests/ssd_minimal.py` — clean, correct `ssd_minimal_discrete()` function
11
- - Supports: Triton kernels, Triton-only, or **pure PyTorch** (toggle via `config.use_triton_kernels = False`)
12
- - Claims output logits match reference implementation
13
- - Uses `einsum` operations matching the paper's math
14
-
15
- ### 2. `tommyip/mamba2-minimal` — Minimal Mamba-2 in PyTorch
16
- - **URL**: https://github.com/tommyip/mamba2-minimal
17
- - "The model's output logits follow the same distribution as the reference implementation but are not numerically equivalent"
18
- - Device agnostic (CPU, MPS, CUDA)
19
-
20
- ### 3. `alxndrTL/mamba.py` — Simple Mamba implementation
21
- - **URL**: https://github.com/alxndrTL/mamba.py
22
- - Has mamba2.py (beta) — "numerically equivalent" to reference
23
- - Implements scan as sequential loop
24
-
25
- ### 4. PyTorch issue #146129 — `torch.compile` on Mamba2 produces NaNs
26
- - Confirms numerical sensitivity of the SSD algorithm
27
- - Even `torch.compile` with Inductor breaks it
28
-
29
- ## Root Cause Analysis
30
-
31
- NVIDIA's `torch_forward` in `modeling_nemotron_h.py` has a "naive ssd implementation" that differs from the reference `ssd_minimal_discrete` in several ways:
32
-
33
- 1. **Manual reshape/permute/loop** instead of `einsum` — error-prone dimension handling
34
- 2. **`segment_sum` implementation** may differ from the stable `segsum` in the reference
35
- 3. **Inter-chunk state propagation** uses different decay formula
36
- 4. **BF16 intermediate precision** — some operations stay in BF16 when they should be FP32
37
-
38
- The reference `ssd_minimal_discrete` uses:
39
- - `torch.einsum` for all contractions (clear, correct)
40
- - `segsum` with masked fill to `-inf` (numerically stable)
41
- - All operations in FP32 via `assert X.dtype == A.dtype` (enforced)
42
-
43
- ## Step-by-Step Plan
44
-
45
- ### Step 1: Port `ssd_minimal_discrete` to GB10
46
- - Copy the reference implementation from `vasqu/mamba2-torch`
47
- - Ensure it works standalone with correct shapes
48
- - **Deliverable**: `ssd_scan_correct.py` with working function
49
-
50
- ### Step 2: Validate against A100 reference tensors
51
- - Extract Mamba layer inputs (x, dt, A, B, C) from the first Mamba layer
52
- - Run both NVIDIA's `torch_forward` scan and our correct scan
53
- - Compare outputs with A100 reference
54
- - **Deliverable**: Test showing correct scan matches A100 within tolerance
55
-
56
- ### Step 3: Monkey-patch NemotronH to use correct scan
57
- - Replace the SSM scan portion of `torch_forward` with the correct implementation
58
- - Keep the rest (projection, conv1d, gated norm, out_proj) unchanged
59
- - Validate full model output matches A100 reference
60
- - **Deliverable**: Patch function that can be applied at model load time
61
-
62
- ### Step 4: Validate perplexity and generation quality
63
- - Run perplexity test on standard text (should be 10-50, not 250-700)
64
- - Run Q8 vs BF16 mismatch test (should be >80% top-1 agreement)
65
- - Generate text and compare with llama.cpp output
66
- - **Deliverable**: Perplexity < 50, top-1 agreement > 70%
67
-
68
- ### Step 5: Integrate into training pipeline
69
- - Update `grpo_v4_train.py` or `sft_train.py` to use the correct scan
70
- - Run SFT training and verify loss is reasonable
71
- - Run GRPO training and verify stable loss
72
- - **Deliverable**: Working training pipeline on GB10
73
-
74
- ### Step 6: Full training run + evaluation
75
- - SFT on interview data (2000 samples, 3 epochs)
76
- - Evaluate against base model (target: improve on 4.35/5)
77
- - **Deliverable**: Trained LoRA adapter
78
-
79
- ## Critical Finding: llama.cpp Uses Sequential Recurrence, NOT Chunked SSD
80
-
81
- **PR #18058** shows llama.cpp implements Mamba-2 as a **simple sequential recurrence** in `ggml_compute_forward_ssm_scan_f32`:
82
-
83
- ```c
84
- // Per token, per head:
85
- dA = exp(softplus(dt) * A); // scalar decay
86
- // Per state dimension:
87
- s_new = s_old * dA + B * x * dt; // state update
88
- y = sum(s_new * C); // output
89
- ```
90
-
91
- This is:
92
- - **Numerically exact** — no cumulative sums, no segment sums, no exp of large values
93
- - **Simple** — ~30 lines of C, trivially portable to PyTorch
94
- - **The reference that works** — llama.cpp produces correct output
95
-
96
- NVIDIA's `torch_forward` uses the **SSD chunked algorithm** which is mathematically equivalent but numerically different. The chunked form computes `exp(cumsum(A))` which produces extreme values (A reaches -5569) causing underflow.
97
-
98
- **We should implement the sequential recurrence form, not the SSD chunked form.**
99
-
100
- ## Key Insight
101
-
102
- The correct scan is **even simpler than `ssd_minimal_discrete`** — just a sequential loop. The hard part isn't the algorithm — it's correctly integrating it into NVIDIA's model code, which has:
103
- - Different tensor shapes and naming conventions
104
- - Group-to-head expansion logic
105
- - Conv1d and gated norm wrapping
106
- - Cache handling for generation
107
-
108
- ## Phase A Results: BREAKTHROUGH ✅ (2026-03-25 04:20 UTC)
109
-
110
- ### Correct Scan Implementation
111
- - Ported llama.cpp's `ggml_compute_forward_ssm_scan_f32` sequential recurrence to PyTorch
112
- - File: `ssm_scan_correct.py` — `ssm_scan_sequential()` function (~40 lines)
113
- - Monkey-patches `torch_forward` on all 21 Mamba layers
114
-
115
- ### Validation Results
116
-
117
- | Metric | Old torch_forward | Correct Scan | Δ |
118
- |--------|------------------|-------------|---|
119
- | Perplexity | 250-700 | **9.02** | 30-70x better |
120
- | Q8 vs BF16 top-1 | 27% | **79%** | +52 pts |
121
- | Q8 vs BF16 top-5 | 43% | **90%** | +47 pts |
122
- | Training CE loss | 3.88 | **2.73** | -30% |
123
- | Loss decrease (1 step) | ~3.88→~3.88 | **2.73→2.71** | ✅ Converging |
124
- | Grad norm | 64 | **14** | ~5x lower |
125
- | Top predictions | `' '`, `'2'`, `','` | `' curiosity'`, `' empathy'`, `' the'` | Coherent! |
126
-
127
- ### Key Findings
128
- 1. The sequential recurrence (`s = s * dA + B * x * dt`) from llama.cpp produces **numerically correct** SSM scan outputs on GB10
129
- 2. The chunked SSD in `torch_forward` uses `exp(cumsum(A))` which produces extreme values (A reaches -5569) causing catastrophic underflow in BF16 — this is what caused all previous training failures
130
- 3. The corrected scan produces coherent word predictions matching llama.cpp's Q8 output
131
- 4. Training with the corrected scan shows proper loss decrease and stable gradients
132
-
133
- ## Updated Plan (2026-03-25 04:20 UTC)
134
-
135
- ### Phase A: Sequential Recurrence (correctness first)
136
- 1. Port llama.cpp's `ggml_compute_forward_ssm_scan_f32` to PyTorch (~20 lines)
137
- 2. Monkey-patch into NemotronH's Mamba layers
138
- 3. Validate against A100 reference tensors (target: >80% top-1 agreement)
139
- 4. Validate perplexity (target: <50 on common English)
140
- 5. If correct → proceed to Phase B
141
-
142
- ### Phase B Results: TRAINING VALIDATED ✅ (2026-03-25 04:30 UTC)
143
-
144
- | Metric | Value | Status |
145
- |--------|-------|--------|
146
- | CE Loss | 2.73 | ✅ Normal (was 3.88-9.99 with old scan) |
147
- | Grad norm | 14.2 | ✅ Stable (was 64-1200) |
148
- | LoRA layers with gradients | 92 | ✅ All layers |
149
- | Loss after 1 optimizer step | 2.73 → 2.71 | ✅ Decreasing |
150
-
151
- ### Phase C: SFT Training Plan
152
-
153
- **Architecture**: Pure PyTorch SFT (no llama.cpp needed for training)
154
- - Model: Nemotron-3-Nano-4B with corrected SSM scan
155
- - Training: HF Trainer + LoRA
156
- - Venv: `.venv-train` (transformers 4.48.3)
157
- - Monkey-patch all 21 Mamba layers at load time
158
-
159
- **Data**: `interview_segments_v2.jsonl` (7,580 segments)
160
- - Each segment: system prompt + guest answer + interviewer question
161
- - Format: chat template with `<|im_start|>` tags
162
- - Train/val split: 95% / 5%
163
-
164
- **Hyperparameters** (based on Kaggle notebook + our validated config):
165
- ```yaml
166
- lora_rank: 64
167
- lora_alpha: 256
168
- target_modules: all-linear
169
- learning_rate: 2e-4
170
- lr_scheduler: cosine
171
- warmup_ratio: 0.1
172
- epochs: 3
173
- batch_size: 1
174
- gradient_accumulation: 4
175
- max_seq_length: 512
176
- bf16: true
177
- max_grad_norm: 1.0
178
- ```
179
-
180
- **Estimated time**: ~3 min/step × 1500 steps = ~75 hours (sequential scan is slow)
181
-
182
- **Speed optimization**: If 75 hours is too long:
183
- - Reduce to 1000 samples × 1 epoch = ~250 steps = ~12 hours
184
- - Or implement `ssd_minimal_discrete` chunked version for parallelism
185
- - Or reduce max_seq_length to 256 (~halves time)
186
-
187
- **Evaluation**: After training, compare with base model (4.35/5) using the existing eval suite
188
-
189
- ### Resources
190
- - llama.cpp sequential scan: `ggml/src/ggml-cpu/ops.cpp:9284` (ggml_compute_forward_ssm_scan_f32)
191
- - vasqu/mamba2-torch chunked scan: `tests/ssd_minimal.py` (ssd_minimal_discrete)
192
- - vLLM Mamba-2 on SM 12.x: confirmed working (vllm issue #34452)
193
- - A100 reference tensors: `reference/reference_tensors.pt`
194
-
195
- ## Risk Assessment
196
-
197
- | Risk | Mitigation |
198
- |------|------------|
199
- | Correct scan still doesn't match A100 | We have reference tensors — iterate until match |
200
- | Performance too slow | Already accepted 3 min/step; correct > fast |
201
- | Integration breaks other layers | Only patch Mamba layers, leave attention/MLP untouched |
202
- | Memory issues | Same model, same memory — just different math |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/SYNTHETIC_DATA_ANALYSIS_2026-03-30.md DELETED
@@ -1,153 +0,0 @@
1
- # Synthetic Data Strategy — Full Analysis
2
- *2026-03-30 | 5 experiments run*
3
-
4
- ## The Question
5
- Can we generate more synthetic data to push `uses_guest` from 56% to 70%+ for LoRA v2?
6
-
7
- ## Answer: No — but here's what will actually work.
8
-
9
- ---
10
-
11
- ## Experiments Run
12
-
13
- ### Exp 1: reward_v11 correlation with uses_guest (n=25)
14
- - `uses_guest=True` questions score **-0.098 lower** on reward_v11
15
- - `probing=True` questions score **-0.244 lower** on reward_v11
16
- - **GRPO with reward_v11 is anti-correlated with both target dimensions**
17
-
18
- ### Exp 2: Controlled reward_v11 test (same guest, 6 question types)
19
- - Off-topic question scored 2.613 vs specific on-topic at 2.446
20
- - reward_v11 doesn't penalize off-topic behavior
21
- - **Confirmed: info-gain (novelty × relevance) rewards genericity**
22
-
23
- ### Exp 3: Historical — GRPO v11 result
24
- - 148 steps with reward_v11: uses_guest −8pp, probing −4pp
25
- - Exactly what the anti-correlation predicts
26
-
27
- ### Exp 4: reward_v11 on LoRA v1 vs Base
28
- - LoRA v1 (better model, 0.733) scores **lower** on reward_v11 than base (0.653)
29
- - The reward can't even rank the models correctly
30
-
31
- ### Exp 5: Echo-targeted prompt vs standard generation (n=60, judged) ← NEW
32
- - Hypothesis: "MUST reference specific words" prompt → more uses_guest in synthetic data
33
- - **Result: FAILED**
34
-
35
- | | Standard | Echo | Delta |
36
- |--|---------|------|-------|
37
- | score | 0.611 | 0.500 | −0.111 |
38
- | on_topic | 63% | 50% | −13pp |
39
- | uses_guest | 43% | **35%** | **−8pp** |
40
- | probing | 76% | 64% | −12pp |
41
- | vocab overlap | 1.93 | 2.57 | +0.64 |
42
-
43
- Overlap goes up but quality goes down — the echo constraint causes trivial short questions
44
- ("What is verification?", "What weight?") that have vocab overlap but fail all 3 judges.
45
-
46
- ---
47
-
48
- ## Root Cause: uses_guest Gap Is NOT a Data Volume Problem
49
-
50
- Training set already has 96% uses_guest=True labels (4,772 pairs).
51
- Model achieves only 56% on held-out eval. Gap = generalization issue, not coverage.
52
-
53
- Two explanations:
54
- 1. **Label noise**: judge sees full guest paragraph, model trained on truncated (512 tok) input.
55
- Overlap words present in full text but not in what model sees at training time.
56
- 2. **Distribution shift**: model learned vocab-echo for in-distribution topics,
57
- generalizes to question *style* but not vocabulary echo on new domains.
58
-
59
- **Either way: more data won't fix it.** Scaling model: +5,000 pairs → only +12pp.
60
-
61
- ## Data Available (if needed)
62
-
63
- | Pool | Size | Notes |
64
- |------|------|-------|
65
- | score=1.0 (current training) | 4,772 | Already used |
66
- | score=0.667, overlap≥2 | 3,979 | Unused reservoir |
67
- | Transcript segments total | ~28,728 | 114 episodes |
68
- | Unused unique guest statements | 936 | Could generate new completions |
69
-
70
- ## Paths That Will Actually Work
71
-
72
- ### PATH 1: Loss-weighted SFT (RECOMMENDED FIRST) ← no new data
73
- Weight training loss by vocab overlap: `loss × (1 + 0.5 × overlap_count)`
74
- Forces model to pay more attention to high-overlap examples.
75
- No new data, no new generation, ~30 min.
76
-
77
- ### PATH 2: Second-stage SFT on high-overlap subset ← no new data
78
- Filter to overlap≥3 pairs (1,579 total: 1,390 gen + 189 real).
79
- Fine-tune LoRA v1 for 1 more epoch on only these pairs.
80
- ~10 min training.
81
-
82
- ### PATH 3: New transcript sources ← genuinely new data
83
- Tim Ferriss, Dwarkesh Patel, Sean Carroll podcasts — same format, diverse domains.
84
- ~2h to crawl + judge, could yield 5-10k high-quality pairs.
85
- Only needed if Paths 1+2 plateau below 65%.
86
-
87
- ### PATH 4: reward_v12 (judge-as-reward GRPO) ← most powerful
88
- Use the 3 binary judges directly as GRPO reward.
89
- Directly optimizes the metric — no proxy mismatch.
90
- Starting from LoRA v1 (0.733 baseline).
91
- ~3h to implement + test.
92
-
93
- ## Decision Tree
94
-
95
- ```
96
- Try PATH 2 (10 min) → eval
97
- if uses_guest ≥ 65% → try PATH 1 too, then PATH 4
98
- if uses_guest < 60% → PATH 2 didn't work, go to PATH 1 with higher weights
99
-
100
- If plateau at ~65% after 1+2:
101
- Build PATH 4 (reward_v12) — this is the ceiling-breaker
102
- ```
103
-
104
- ## Do NOT Do
105
- - ✗ Echo-targeted generation (Exp 5: makes everything worse)
106
- - ✗ More of the same synthetic data (Exp 1-4: wrong signal, won't generalize)
107
- - ✗ GRPO with reward_v11 (anti-correlated with target dimensions)
108
-
109
- ---
110
-
111
- ## Deep Dive: Why New Podcast Sources Don't Help (Experiments 6-8, 2026-03-30)
112
-
113
- ### Experiment 6: OOV Rate vs uses_guest Correlation
114
- **Question**: Do guests with vocabulary not seen in training cause uses_guest failures?
115
-
116
- - High-OOV guests (≥30% novel words, n=9): uses_guest = **56%**
117
- - Low-OOV guests (<30% novel words, n=16): uses_guest = **56%**
118
- - **Delta: -1%. Zero correlation.**
119
-
120
- Domain diversity does not predict success or failure. The model succeeds and fails equally on familiar and novel topics.
121
-
122
- ### Experiment 7: Failure Mode Classification
123
- 11 uses_guest=False cases broken down:
124
- - **Off-topic (6/11, 55%)**: on_topic=False — model asked about different subject entirely
125
- - **Generic-probing (5/11, 45%)**: on_topic=True, probing=True — right direction, wrong vocab
126
-
127
- Both failure types share the same pattern: generic openers ("How do you think/see/envision")
128
- that don't require referencing the guest's specific words.
129
-
130
- ### Experiment 8: Template Contamination Discovery
131
- The smoking gun:
132
-
133
- | Source | Generic opener rate | Examples |
134
- |--------|-------------------|---------|
135
- | Real Lex (697 pairs) | **2%** | "Can you speak to...", "Do you think that's...", "What is..." |
136
- | Generated (4,075 pairs) | **30%** | 253× "How do you think", 198× "Why do you think", 159× "How do you reconcile" |
137
-
138
- The generated data (85% of training) teaches 5 near-universal templates.
139
- These openers work for ANY guest without referencing specific vocabulary.
140
- The model learned to use them universally → uses_guest fails on novel prompts.
141
-
142
- ### Why Dwarkesh/Carroll Podcasts Won't Fix This
143
- 1. The failure is template contamination, not domain coverage — OOV test proves it
144
- 2. New podcasts would be generated the same way → same base-model template bias
145
- 3. Mixed interviewer styles could dilute the real Lex signal
146
- 4. **Exception**: REAL Dwarkesh questions (not generated) show vocab-echo naturally — but only if you use his actual words from transcripts, not model-generated imitations
147
-
148
- ### Revised Solution: Filter + Upsample (no new data needed)
149
- - Remove 1,247 generic-opener generated pairs (26% of training set)
150
- - Upsample 697 real Lex pairs 6× → 4,182 effective examples
151
- - Real Lex weight: 15% → 54% of training signal
152
- - Expected: uses_guest 56% → 65-70%
153
- - Cost: zero new data, ~30 min training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TECHNICAL_CHALLENGES.md DELETED
@@ -1,181 +0,0 @@
1
- > **Status:** 📋 PRE-PROJECT — Technical risks identified before starting. To be updated as challenges are resolved or new ones emerge.
2
-
3
- # Technical Challenges — Lex Fridman AI Interviewer
4
-
5
- ## P0: Must Resolve Before Full Pipeline
6
-
7
- ### 1. Speaker Separation in Transcripts
8
-
9
- Existing HF datasets (`Drewd/lex_fridman_podcast_transcripts`, `Whispering-GPT/...`) are **flat text** — no speaker labels. We need to know which text is Lex and which is the guest.
10
-
11
- **Options:**
12
- | Approach | Accuracy | Speed | Cost |
13
- |----------|----------|-------|------|
14
- | Whisper + pyannote diarization | ~95% | Slow (real-time per episode) | Free (local) |
15
- | Heuristic splitting (short=Lex, long=guest) | ~80% | Fast | Free |
16
- | LLM-based splitting (Claude/GPT-4) | ~98% | Medium | ~$0.50/episode |
17
-
18
- **Mitigation:** Test diarization on 3 episodes first. If too slow, try LLM splitting on a batch.
19
-
20
- ### 2. Mamba-2 + LoRA Compatibility
21
-
22
- Nemotron 3 Nano 4B has:
23
- - **Mamba-2 layers** (majority) — state-space model, NOT attention
24
- - **4 attention layers** only
25
-
26
- LoRA traditionally targets `q_proj, k_proj, v_proj, o_proj` — but Mamba-2 layers don't have these.
27
-
28
- **Key questions:**
29
- - Which Mamba-2 parameters does LoRA target? (`in_proj`, `out_proj`, `dt_proj`?)
30
- - Is the LoRA effective on Mamba layers, or does it only fine-tune the 4 attention layers?
31
- - If LoRA only hits 4/N layers, we might need full fine-tune instead
32
-
33
- **Mitigation:** Run a 5-minute SFT test with Unsloth. Check which layers get LoRA adapters. Inspect `model.print_trainable_parameters()`.
34
-
35
- ---
36
-
37
- ## P1: Must Validate Early
38
-
39
- ### 3. Inverted Role Training
40
-
41
- All instruction-tuned models are trained: user asks → assistant answers. We're doing the opposite: assistant asks → user answers.
42
-
43
- **Risks:**
44
- - Model slips into "answering mode" and lectures instead of questioning
45
- - Model generates both the question AND the answer
46
- - Model refuses to ask questions and tries to be helpful instead
47
-
48
- **Mitigation:** Train on 100 examples, evaluate whether the model asks or answers. If SFT can't overcome the instruction-tuning prior, GRPO with a question-format reward may be needed.
49
-
50
- ### 4. Model Freshness (Released 2 Days Ago)
51
-
52
- - Unsloth claims day-zero support but edge cases/bugs are likely
53
- - vLLM requires `>=0.15.1` for Nemotron 3
54
- - HF `transformers` support may need latest main branch
55
- - Custom reasoning parser needed for inference (`nano_v3_reasoning_parser.py`)
56
- - Community examples are near-zero — we're early adopters
57
-
58
- **Mitigation:** Verify Unsloth SFT runs end-to-end on 10 fake examples before committing to the full data pipeline.
59
-
60
- ---
61
-
62
- ## P2: Address During Development
63
-
64
- ### 5. Evaluation is Harder Than Routangseng
65
-
66
- For routangseng we checked: does it start with a judgment? Does it have analogies? Simple heuristics worked.
67
-
68
- For an interviewer, "good question" is much more subjective:
69
- - Is "What is consciousness?" good for a neuroscientist? Yes. For a plumber? No.
70
- - Context-dependence makes heuristic eval harder
71
- - May need to rely more on LLM judge or human eval
72
-
73
- **Possible heuristics:**
74
- - Ends with `?` (asks a question, not a statement)
75
- - References the guest's last answer (listening)
76
- - Under 50 words (short questions = Lex style)
77
- - Doesn't repeat a previous question
78
- - Doesn't contain generic filler ("That's interesting, tell me more")
79
-
80
- ### 6. Question Diversity Collapse
81
-
82
- Risk: the model learns 5-10 "Lex templates" and rotates through them:
83
- - "What does X mean to you?"
84
- - "Take me back to when you first..."
85
- - "What gives you hope?"
86
-
87
- This is the equivalent of routangseng's "用户问..." problem — a format the model falls into. Harder to detect heuristically.
88
-
89
- **Mitigation:** Monitor during eval. If detected, add GRPO with a diversity reward (penalize n-gram overlap with previous questions in the conversation).
90
-
91
- ### 7. Multi-Turn Context Management
92
-
93
- Lex interviews are 2-3 hours = ~50K-100K tokens. Training constraints:
94
- - `max_seq_length=4096` only captures ~5 minutes of conversation
95
- - Longer sequences → more VRAM, slower training
96
- - But shorter segments lose the conversational arc (topic bridging, callbacks)
97
-
98
- **Tradeoff:** Train on short segments (3-8 turns, ~2K tokens), rely on 1M inference context for long-arc skills. The model may not learn long-arc techniques like topic bridging from short training segments — but the base model's 1M context + Lex's patterns in the data may be enough.
99
-
100
- ### 8. Thinking Mode Interaction
101
-
102
- Nemotron 3 uses `<think>` token ID 12 and `</think>` token ID 13 (integer IDs, not text tokens like Qwen3.5).
103
-
104
- **Open questions:**
105
- - Should the model think before asking a question?
106
- - If yes, what should the thinking content look like? ("The guest just mentioned X, I should probe deeper on...")
107
- - If no, how do we suppress it without breaking the model?
108
- - Does the training data need `<think>...</think>` blocks?
109
-
110
- **Mitigation:** Test early — generate with thinking ON and OFF, compare question quality.
111
-
112
- ---
113
-
114
- ## Risk Mitigation: Day 1 Plan
115
-
116
- **Don't start the full data pipeline.** Instead, de-risk the two biggest unknowns:
117
-
118
- | # | Test | Time | What it tells us |
119
- |---|------|------|-----------------|
120
- | 1 | Download Nemotron 3 Nano 4B | 10 min | Verify it loads |
121
- | 2 | Unsloth SFT with 10 fake interview examples | 30 min | Does fine-tuning work end-to-end? |
122
- | 3 | Inspect LoRA targets on Mamba-2 layers | 5 min | How many layers actually get trained? |
123
- | 4 | Test speaker separation on 3 real Lex transcripts | 1 hour | Which diarization approach works? |
124
- | 5 | Generate 5 questions with fine-tuned model | 10 min | Does the model ask or answer? |
125
-
126
- If all 5 pass → proceed to full pipeline.
127
- If test 2 or 3 fail → evaluate fallback to full fine-tune or different base model.
128
- If test 4 fails → invest in LLM-based splitting.
129
-
130
- ---
131
-
132
- ## Challenge Resolution Log
133
-
134
- | Date | Challenge | Status | Resolution |
135
- |------|-----------|--------|------------|
136
- | 2026-03-19 | Speaker separation | ✅ Solved | lexfridman.com has official transcripts with `<span class="ts-name">` speaker labels. No diarization needed. |
137
- | 2026-03-19 | Mamba-2 + LoRA compatibility | ✅ Solved | Unsloth handles Mamba-2 LoRA automatically. 10.1M trainable params (0.38%). |
138
- | 2026-03-19 | Triton ptxas on Blackwell | ✅ Solved | Symlink system ptxas (CUDA 13.2) over Triton's bundled ptxas (CUDA 12.8): `ln -sf /usr/local/cuda/bin/ptxas /path/to/triton/backends/nvidia/bin/ptxas` |
139
- | 2026-03-19 | Python inference garbage | ✅ Workaround | Unsloth/transformers generation produces garbage on GB10. llama.cpp GGUF works perfectly (889 tok/s prompt, 53 tok/s gen). Use llama.cpp for all inference. |
140
- | 2026-03-19 | Model merge breaks Mamba | ✅ Confirmed | `merge_and_unload()` on Nemotron hybrid → garbage output. Keep LoRA separate, export to GGUF via Unsloth. |
141
- | 2026-03-19 | OOM during training | ✅ Solved | Caused by variable-length sequences in batch. Fix: measure actual token distribution, set max_seq_length to P100 + 10% buffer. |
142
- | 2026-03-19 | SFT worse than base | ⚠️ Open | SFT model (2.10/5) scored worse than base (4.35/5). Root cause under investigation — likely training data format or special token handling. |
143
- | 2026-03-19 | max_tokens for thinking | ✅ Solved | Nemotron/Gemini spend tokens on `<think>` reasoning. Must set max_tokens ≥ 800 to leave room for actual answer. |
144
- | 2026-03-19 | Inverted role training | ⚠️ Open | Not yet validated whether the model learned to ask (not answer). Need better eval after fixing inference path. |
145
- | 2026-03-23 | Off-policy GRPO fails | ❌ Failed | llama.cpp (base) generates completions, LoRA gets gradient updates — models are decoupled. LoRA diverges to gibberish after ~50 steps. See `docs/GRPO_V3_POSTMORTEM.md` for 6 identified gaps. |
146
- | 2026-03-23 | GGUF converter: NemotronH MoE | ✅ Workaround | `NemotronHConfig` has MoE defaults (`num_experts_per_tok=2`). Converter detects these and uses wrong architecture. Fix: set `num_experts_per_tok=0` in config.json + patch converter check to `> 0`. |
147
- | 2026-03-23 | Model merge → garbage | ✅ Confirmed | `merge_and_unload()` + GGUF export works mechanically, but GRPO v3 LoRA weights produce gibberish. This is a training failure, not a merge/export bug. |
148
-
149
- ## Key Lessons Added
150
-
151
- ### Eval-First Principle
152
- **Always run eval on the base model through the production inference pipeline BEFORE training.**
153
- We could have caught the Triton/Mamba inference issue in 5 minutes instead of spending hours training a model that couldn't generate.
154
-
155
- ### Data-Driven Hyperparameters
156
- Don't pick round numbers. Measure the data distribution and pick values that fit.
157
- - `max_seq_length`: P100 of actual token lengths + 10% buffer, aligned to 64
158
- - `max_tokens` for inference: must account for thinking budget
159
- - `batch_size`: profile VRAM at target sequence length, leave 30% headroom
160
-
161
- ### Infrastructure Reliability Hierarchy (for Blackwell/DGX Spark)
162
- 1. **llama.cpp** — most reliable, NVIDIA-supported, works out of the box
163
- 2. **Unsloth training** — works for SFT/GRPO with ptxas fix
164
- 3. **Python inference** — broken for Mamba hybrid on this hardware, avoid
165
-
166
- ### Off-Policy RL is Fundamentally Broken for Hybrid Architectures
167
- Do NOT attempt off-policy GRPO where the generator and learner are different models. Specific gaps identified in GRPO v3:
168
- 1. **Off-policy generation** — generator (llama.cpp base) ≠ learner (HF + LoRA)
169
- 2. **No KL reference** — `-β * mean(log_probs)` is NOT a KL divergence
170
- 3. **Token truncation** — 512 max_length in forward pass vs 800 max_tokens in generation
171
- 4. **Mamba architecture** — LoRA can't modify SSM recurrence dynamics (A, D, conv1d)
172
- 5. **Thinking in training, not in reward** — gradient covers `<think>` tokens, reward ignores them
173
- 6. **No credit assignment** — uniform token weighting, no per-token reward signal
174
-
175
- See `docs/GRPO_V3_POSTMORTEM.md` for the full analysis.
176
-
177
- *To be updated as new challenges are resolved.*
178
-
179
- ---
180
-
181
- *Created: 2026-03-18*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TECHNICAL_REVIEW_2026-03-23.md DELETED
@@ -1,205 +0,0 @@
1
- # Technical Review — Assumptions That Led to Off-Policy GRPO
2
-
3
- > **Date:** 2026-03-23
4
- > **Purpose:** Retrospective on two false assumptions that forced us into the broken off-policy architecture, and what alternatives actually existed.
5
-
6
- ---
7
-
8
- ## The Decision Chain
9
-
10
- ```
11
- GPU can't do autoregressive generation (PyTorch 2.9 + SM 12.1)
12
-
13
- "Only llama.cpp can generate" ← Assumption 1
14
-
15
- "Can't get gradients from llama.cpp" ← Assumption 2
16
-
17
- Off-policy GRPO: llama.cpp generates, HF model trains on those completions
18
-
19
- 6 critical gaps → LoRA diverges to gibberish
20
- ```
21
-
22
- Both assumptions were partially true but had alternatives we didn't explore.
23
-
24
- ---
25
-
26
- ## Assumption 1: "Only Unsloth/TRL exist for training"
27
-
28
- **What we believed:** HF transformers + Unsloth is the only viable training framework for Nemotron-3-Nano-4B. Since autoregressive generation produces garbage on GB10 (PyTorch 2.9 doesn't fully support SM 12.1), we can't do on-policy RL.
29
-
30
- **What we missed:**
31
-
32
- ### NVIDIA NeMo RL
33
- NVIDIA's own open-source RL training library, purpose-built for Nemotron models.
34
-
35
- - **Supports:** GRPO, DAPO, SFT, DPO, RM, on-policy distillation
36
- - **On-policy GRPO:** Uses vLLM for generation within the training loop — same model generates and trains (fixes Gap 1)
37
- - **Proper KL divergence:** Built-in reference policy (fixes Gap 2)
38
- - **Architecture-aware:** Designed for Nemotron's hybrid Mamba-Transformer-MoE architecture
39
- - **Single node support:** Can run on 1 GPU (our setup)
40
- - **URL:** https://docs.nvidia.com/nemo/rl/latest/
41
- - **GitHub:** https://github.com/NVIDIA-NeMo/RL
42
-
43
- **Key question (untested):** Does NeMo RL's vLLM path handle Mamba-2 autoregressive generation correctly on SM 12.1? If the CUDA kernel issue is PyTorch-specific and vLLM has its own kernels, NeMo RL could work end-to-end on our GB10.
44
-
45
- **Risk:** NeMo RL is designed for multi-GPU clusters. Single-GPU support exists but may have rough edges. Docker-based workflow could hit the same causal-conv1d/mamba-ssm build issues we saw before.
46
-
47
- ### NVIDIA's Own Training Pipeline
48
- The Nemotron-3 paper and blog describe their full training pipeline including RL. NVIDIA trained these models with NeMo RL internally. The training recipe, datasets, and configurations are published.
49
-
50
- **What this means:** The "right" way to RL-train Nemotron is NeMo RL, not a custom script on top of HF transformers.
51
-
52
- ---
53
-
54
- ## Assumption 2: "Can't use llama.cpp for training"
55
-
56
- **What we believed:** llama.cpp is inference-only. No gradient computation, no backprop. So we need a separate HF model for training.
57
-
58
- **What we missed:**
59
-
60
- ### llama.cpp `llama-finetune` (SFT LoRA training)
61
- llama.cpp has a built-in LoRA fine-tuning binary since 2023 (PR #2632).
62
-
63
- - **Binary exists:** `/home/bobber/llama.cpp/build/bin/llama-finetune` ✅
64
- - **Supports:** LoRA SFT with AdamW/SGD, learning rate scheduling, validation split, checkpointing
65
- - **Works on GGUF:** Trains directly on quantized GGUF models — no HF conversion needed
66
- - **On-policy by design:** The same model that generates during training is the one being updated (if we used it for on-policy generation)
67
-
68
- **Current status (tested):**
69
- ```
70
- llama-finetune -m nemotron-Q4_K_M.gguf -f train.txt -c 64
71
- → ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 18446744073709547520
72
- ```
73
-
74
- **Crashes** with a near-max uint64 buffer allocation. The finetune code doesn't properly handle NemotronH's Mamba-2 architecture — it computes buffer sizes based on standard transformer assumptions that break for SSM layers. This is a llama.cpp bug, not a fundamental limitation.
75
-
76
- **Viability assessment:**
77
- - ✅ Binary exists, CLI is full-featured
78
- - ✅ Would solve off-policy gap (SFT uses same model for forward+backward)
79
- - ❌ Crashes on NemotronH (buffer size computation bug for Mamba architecture)
80
- - ❌ SFT only — no GRPO/RL support
81
- - ⚠️ Even if fixed, SFT was insufficient in previous attempts (scored 2.0-3.2/5)
82
-
83
- ### llama.cpp LoRA at Inference Time
84
- llama.cpp supports loading LoRA adapters at runtime via `--lora` flag. This means:
85
- - Train a LoRA with any tool (HF, NeMo, etc.)
86
- - Convert to GGUF LoRA format
87
- - Load at inference time on top of the base GGUF
88
-
89
- **This could enable on-policy GRPO:** Generate with `base + current LoRA` via llama.cpp server, compute log-probs with HF, update LoRA. But converting LoRA to GGUF every step is expensive.
90
-
91
- ---
92
-
93
- ## What We Should Have Done
94
-
95
- ### Before writing custom GRPO code:
96
-
97
- 1. **Check NVIDIA's own training tools first.** NVIDIA published NeMo RL specifically for RL training of Nemotron models. We built a custom off-policy GRPO script on HF transformers instead of using the purpose-built tool.
98
-
99
- 2. **Test llama-finetune.** Even if it crashes on NemotronH now, knowing that would have informed the architecture decision. We could have filed a bug or patched the buffer computation.
100
-
101
- 3. **Test vLLM generation on GB10.** NeMo RL uses vLLM for generation. If vLLM's Mamba kernels work on SM 12.1, the entire on-policy GRPO pipeline works. We never tested this.
102
-
103
- 4. **Consider SFT distillation first.** The base model was already excellent (4.35/5). Instead of RL, we could have generated a large corpus of high-quality completions with the base model and done SFT distillation. This is what "Option B: SFT on curated completions" proposes, and it was always available.
104
-
105
- ### The root cause of the root cause:
106
- We jumped to "build custom GRPO" because we knew the components (llama.cpp for generation, HF for training) and it felt like a clever workaround for the GPU limitation. But we didn't step back and ask: **"How did NVIDIA train this model?"** — the answer (NeMo RL) was in their documentation all along.
107
-
108
- ---
109
-
110
- ## Updated Alternative Paths
111
-
112
- | Path | Fixes | Risk | Effort |
113
- |------|-------|------|--------|
114
- | **NeMo RL on-policy GRPO** | All 6 gaps | vLLM may not work on SM 12.1; Docker build issues | Medium-High |
115
- | **SFT on curated completions** | Gaps 1-3, 5-6 | SFT previously scored ≤3.2/5; may not beat base | Low |
116
- | **Fix llama-finetune for NemotronH** | Gap 1 (SFT only) | SFT limitation; patch effort unknown | Medium |
117
- | **On-policy GRPO with periodic merge** | Gap 1 | Merge→GGUF→reload cycle per N steps; slow | High |
118
- | **Ship base model** | N/A | Already 4.35/5; may be good enough | None |
119
- | **vLLM test on GB10** | Determines if NeMo RL viable | May crash like PyTorch did | Low (just a test) |
120
-
121
- ---
122
-
123
- ## vLLM Test Results (2026-03-23)
124
-
125
- **✅ vLLM 0.18.0 works on GB10 with Nemotron-3-Nano-4B!**
126
-
127
- ### Setup
128
- - vLLM 0.18.0, torch 2.10.0+cu130, transformers 5.3.0
129
- - `gpu_memory_utilization=0.3` (safe with ComfyUI using ~20 GB)
130
- - Model loaded in 7.43 GiB, CUDA graphs captured successfully
131
- - FlashAttention v2 backend selected automatically
132
-
133
- ### Compatibility Notes
134
- - vLLM 0.18.0 requires `transformers<5`, but Nemotron model requires `transformers>=5` (`TokenizersBackend` class). Installing transformers 5.3.0 over vLLM's constraint **works despite the pip warning**.
135
- - Same SM 12.1 PyTorch warning appears but doesn't block execution.
136
-
137
- ### Generation Quality
138
- The output is **coherent** — proper English, structured thinking, model understands the prompt. This confirms vLLM's Mamba-2 kernels work correctly on SM 12.1, unlike raw PyTorch autoregressive generation.
139
-
140
- ### Performance
141
- - **3.59 tok/s output** — significantly slower than llama.cpp (~60 tok/s Q8, ~100+ tok/s Q4)
142
- - Startup: ~205 seconds (model loading + torch.compile + CUDA graph capture)
143
- - This is BF16 (8 GB) vs llama.cpp's Q4 (2.9 GB), so memory bandwidth disadvantage explains much of the speed difference
144
-
145
- ### Implications for NeMo RL
146
- vLLM generation works on GB10 → **NeMo RL on-policy GRPO is likely viable on this hardware.** The slow generation speed (3.59 tok/s) means training would be much slower than llama.cpp-based generation, but it would be **correct** (on-policy, same model generates and trains).
147
-
148
- ### Remaining Question — ANSWERED ✅
149
-
150
- **Can vLLM generation + PyTorch training coexist on the same GPU?**
151
-
152
- **Yes — comfortably.** Tested 2026-03-23 with dual model loading:
153
-
154
- #### Test Setup
155
- - vLLM 0.18.0 in subprocess (`gpu_memory_utilization=0.3`, `max_model_len=1024`)
156
- - HF model (transformers 5.3.0 native NemotronH) in main process, full BF16
157
- - AdamW optimizer with full parameter training
158
- - Forward + backward + optimizer step, then vLLM generation while training model loaded
159
-
160
- #### Memory Results
161
-
162
- | Component | Allocated | Reserved |
163
- |-----------|-----------|----------|
164
- | vLLM subprocess (model + KV cache + CUDA graphs) | ~39 GB | (separate process) |
165
- | HF training model (BF16) | 5.27 GB | 5.32 GB |
166
- | Peak during backward | 10.62 GB | 20.15 GB |
167
- | After optimizer.step (states allocated) | 21.16 GB | 27.06 GB |
168
- | After zero_grad (steady state) | 15.89 GB | 27.06 GB |
169
-
170
- **Total estimated GPU usage: ~55 GB | Free: ~76 GB**
171
-
172
- With LoRA (realistic for NeMo RL): **~39 GB total | ~92 GB free**
173
-
174
- #### Key Observations
175
- 1. **vLLM runs in a separate subprocess** — its memory doesn't appear in the main process's `torch.cuda.memory_allocated()`. Both share the GPU via CUDA's unified memory management.
176
- 2. **vLLM generation works fine while training model holds 27 GB** — no interference, no OOM.
177
- 3. **Generation speed during concurrent load: 4.56 tok/s** (slightly faster than solo, likely CUDA graph warmup effect).
178
-
179
- #### Caveats
180
- - **Native transformers NemotronH has a config mismatch**: NVIDIA's `hybrid_override_pattern` uses dashes (`M-M-M-MM-...`) which the native `_pattern_to_list` doesn't handle. After patching dash handling, the pattern produced 25 layers instead of 42, causing MISSING/UNEXPECTED weight warnings and wrong loss (13.16 vs expected ~3-4). For real training, either:
181
- - Fix the config conversion properly for native transformers, or
182
- - Use `trust_remote_code=True` with a pure PyTorch fallback for `causal_conv1d` (no CUDA kernel needed for training, only the forward/backward math)
183
- - **GB10's nvidia-smi returns [N/A]** for memory stats, so cross-process GPU memory can't be directly measured — estimates based on vLLM's 0.3 utilization setting.
184
-
185
- #### Verdict
186
- Memory is **not a bottleneck** for on-policy GRPO on GB10. The open questions are now:
187
- 1. Can NeMo RL be installed on GB10? (Docker vs native)
188
- 2. Does NeMo RL's training loop work with the native transformers NemotronH, or does it require the custom code?
189
- 3. What's the end-to-end training throughput given vLLM's ~3.5 tok/s generation speed?
190
-
191
- ---
192
-
193
- ## Lessons
194
-
195
- 1. **Check the vendor's tools before building custom.** NVIDIA published NeMo RL for exactly this use case. We reinvented a broken wheel.
196
-
197
- 2. **"Can't do X" should trigger "how does the vendor do X?"** — not "let me build a workaround for X."
198
-
199
- 3. **Test alternatives before committing to workarounds.** We spent 10+ hours on off-policy GRPO instead of 15 minutes testing vLLM or llama-finetune.
200
-
201
- 4. **The cleverest workaround is often the wrongest approach.** Off-policy GRPO felt smart — two models collaborating! But it violated fundamental RL principles.
202
-
203
- ---
204
-
205
- *Created: 2026-03-23*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TRAINING_PLAN_V5.md DELETED
@@ -1,143 +0,0 @@
1
- # Training Plan V5 — Beat Base Model (7.12/10)
2
-
3
- **Goal:** Nemotron 4B fine-tuned > 7.12/10
4
- **Updated:** 2026-03-26 UTC
5
- **Status:** GRPO v5 ready to run — all infrastructure proven
6
-
7
- ---
8
-
9
- ## Current Baseline
10
-
11
- Every fine-tuning attempt so far has scored below base:
12
-
13
- | Approach | Score | Why it failed |
14
- |---|---|---|
15
- | LoRA SFT (v1, v2) | 2.00–2.10/5 | LoRA targets only 4 attention layers; 38 Mamba layers untouched |
16
- | Full SFT (v1) | 2.00/5 | Format mismatch — think-tags buried output |
17
- | Full SFT (v2, v3) | 3.20/5 | Contaminated data (`user\n` prefixes, low question rate) |
18
- | SFT v4 Triton (LoRA) | 5.36/10 | Surface pattern-matching, no reasoning preserved |
19
- | GRPO v3 (off-policy) | gibberish | Generator ≠ learner — 6 critical gaps |
20
- | GRPO v4 (PyTorch mock) | oscillating loss | rmsnorm mock silently broken |
21
-
22
- **Base model: 7.12/10** — uses `<think>` to construct precise, contextual questions.
23
-
24
- ---
25
-
26
- ## GRPO v5 — The Current Approach
27
-
28
- All 6 gaps from v3 postmortem fixed:
29
-
30
- ### Architecture
31
- ```
32
- Per step:
33
- 1. generate_cached_batch(model, n=4) ← 1× prefill, 4 parallel decodes
34
- 2. reward_fn() for each completion ← heuristic: question?, references guest?, brevity?, etc.
35
- 3. GRPO normalize advantages within group
36
- 4. policy log-probs: model(comp) ← with LoRA enabled
37
- 5. ref log-probs: model(comp) ← with LoRA disabled (free, no second model)
38
- 6. loss = -adv * log_prob + kl_coef * KL
39
- 7. backward() through Triton SSM kernel ← all 42 layers trained
40
- ```
41
-
42
- ### Key design choices
43
- - **On-policy**: `generate_cached()` uses the same patched Mamba model (Triton kernel) that computes gradients. Zero off-policy gap.
44
- - **KL without second model**: Disabling PEFT adapter layers = base model forward. No extra memory.
45
- - **Batched generation**: `generate_cached_batch(n=4)` — one prefill shared across 4 completions. ~4× faster generation than serial calls.
46
- - **Vectorized SSM state save**: Final state computed as a single tensor op (reverse cumsum), no Python loop across sequence length.
47
- - **Triton backward**: Full gradient through all 21 Mamba layers + 4 Attention layers.
48
-
49
- ### Config
50
- ```python
51
- lr = 5e-5
52
- kl_coef = 0.02
53
- total_steps = 500
54
- num_generations = 4 # per prompt
55
- gen_max_tokens = 300
56
- lora_rank = 64
57
- lora_alpha = 256
58
- ```
59
-
60
- ### Expected performance
61
- - **Generation**: ~15-20s (batched prefill + parallel decode, ~16 tok/s cached)
62
- - **Triton backward**: ~9.5s/step
63
- - **Total step**: ~25-40s (vs 464s in earlier uncached version, vs 95s in serial cached version)
64
- - **500 steps**: ~4-6 hours
65
-
66
- ---
67
-
68
- ## Decision Tree Post-Run
69
-
70
- ```
71
- GRPO v5 result
72
- ├── Score > 7.12/10 → ✅ Beat base model! Run full 500 steps, publish.
73
- ├── Score 6.5–7.12 → Good progress. Add thinking-chain data (see Phase 2).
74
- ├── Score 5.5–6.5 → Similar to SFT. GRPO signal is weak. Try Phase 2 data.
75
- └── Score < 5.5 → Something broken. Debug reward/KL balance.
76
- ```
77
-
78
- ---
79
-
80
- ## Phase 2: Thinking-Chain Data (if GRPO v5 insufficient)
81
-
82
- The base model's edge is `<think>` reasoning. Teach the fine-tuned model to reason:
83
-
84
- 1. Run base model on all 7,580 training scenarios via `generate_cached`
85
- 2. Collect `<think>...</think>[question]` responses
86
- 3. Filter: keep responses scoring ≥ 7/10 on 10-score eval
87
- 4. SFT on these thinking-chain examples — distillation from base model's own best outputs
88
-
89
- This is fundamentally different from previous SFTs (which used human transcripts). The training data would be base model completions with explicit reasoning chains — teaching the fine-tuned model *how* to think, not just what format to produce.
90
-
91
- **Estimated corpus**: ~3,000-5,000 examples from 7,580 scenarios
92
- **Training time**: 3,000 steps × 9.5s = ~8 hours
93
-
94
- ---
95
-
96
- ## Phase 3: Full Dataset SFT Baseline (optional)
97
-
98
- Previous SFTs used only 1,000 samples. 7,580 are available.
99
-
100
- ```python
101
- SAMPLES = 7580
102
- EPOCHS = 2
103
- LR = 1e-4
104
- BATCH = 1
105
- GRAD_ACCUM = 4
106
- ```
107
-
108
- - **Steps**: 3,790 (2 epochs)
109
- - **Time**: ~10 hours at 9.5s/step
110
- - **Expected**: 6.0–6.5/10 (extrapolating from full-sft-v4/v5 pattern)
111
-
112
- Run this as a parallel experiment to understand data volume impact independently of GRPO.
113
-
114
- ---
115
-
116
- ## Infrastructure Status
117
-
118
- | Component | Status | Notes |
119
- |---|---|---|
120
- | Triton fwd kernel | ✅ Working | 10.6× forward speedup |
121
- | Triton bwd kernel | ✅ Working | 5.7× end-to-end speedup, 9.5s/step |
122
- | SSM state save (vectorized) | ✅ Working | No Python loop, pure tensor op |
123
- | Cached generation | ✅ Working | ~16 tok/s, O(1) per token |
124
- | Batched generation (n=4) | ✅ Working | 1× prefill for n completions |
125
- | GRPO v5 training loop | ✅ Implemented | `grpo_v5_train.py` |
126
- | 10-score eval | ✅ Working | `scripts/eval_v2.py` |
127
- | GGUF conversion + eval | ✅ Working | `merge_and_eval.py` |
128
-
129
- ---
130
-
131
- ## Timeline
132
-
133
- | Step | Time | Outcome |
134
- |---|---|---|
135
- | GRPO v5 smoke test (5 steps) | ~5 min | Verify step time, reward signal, loss stability |
136
- | GRPO v5 eval checkpoint (50 steps) | ~30 min | Merge → GGUF → 10-score eval |
137
- | GRPO v5 full run (500 steps) | ~4-6 hours | Main result |
138
- | If needed: thinking-chain data gen | ~3 hours | 7,580 base model completions |
139
- | If needed: thinking-chain SFT | ~8 hours | Phase 2 fine-tune |
140
-
141
- ---
142
-
143
- *Updated: 2026-03-26 | Previous: `docs/CURRENT_STATE_2026-03-23.md`*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TRITON_PIPELINE_FIX.md DELETED
@@ -1,137 +0,0 @@
1
- # Triton Pipeline Fix — NemotronH HF Generation
2
-
3
- Date: 2026-03-27
4
- Status: ✅ Fixed and verified
5
-
6
- ---
7
-
8
- ## Problem
9
-
10
- HF model generation with `generate_cached` produced degenerate output:
11
- - Max logit divergence of 10+ between cached and uncached decode
12
- - Tokens like `│││││` and `«»«««` repeated endlessly
13
- - `</think>` never closed naturally
14
- - Outputs completely incoherent after a few tokens
15
-
16
- Previously attributed to numerical issues (BF16 SSM precision), but those were fixed.
17
- The real bug was elsewhere.
18
-
19
- ---
20
-
21
- ## Diagnosis
22
-
23
- Systematic elimination:
24
-
25
- | Component | Test | Result |
26
- |---|---|---|
27
- | Prefill formula | vs `mamba_chunk_scan_combined` | ✅ 0.04% error (floating point) |
28
- | Decode formula | vs `selective_state_update`, 100 steps | ✅ 0.000% error |
29
- | Layer-by-layer hidden states | cached vs uncached | ❌ Layer 12 diverges 10.89 logits |
30
-
31
- Layer 12 is the **first attention layer**. Layers 0-11 (all Mamba) were fine.
32
-
33
- ---
34
-
35
- ## Root Cause
36
-
37
- `NemotronHBlock.forward` for attention layers called:
38
-
39
- ```python
40
- # BROKEN (model source code bug):
41
- hidden_states = self.mixer(
42
- hidden_states,
43
- cache_position=cache_position
44
- # past_key_value NOT PASSED
45
- )
46
- ```
47
-
48
- `NemotronHAttention.forward` requires `past_key_value` for KV caching. Without it,
49
- every decode token ran attention over **only itself** — single-token context with no
50
- history from prefill. This produced garbage at layers 12, 17, 24, 32 (the 4 attention
51
- layers) and cascaded through all 42 layers.
52
-
53
- ---
54
-
55
- ## Fix
56
-
57
- Added `_patch_attention_block()` to `tests/validate_correct_scan.py`.
58
- Called automatically from `patch_mamba_layers()`.
59
-
60
- ```python
61
- def _patch_attention_block(layer):
62
- def patched_forward(self, hidden_states, cache_params=None,
63
- cache_position=None, attention_mask=None):
64
- ...
65
- hidden_states = self.mixer(
66
- hidden_states,
67
- past_key_value=cache_params, # ← THE FIX
68
- cache_position=cache_position,
69
- use_cache=(cache_params is not None),
70
- )
71
- hidden_states = hidden_states[0]
72
- ...
73
- layer.forward = patched_forward.__get__(layer, layer.__class__)
74
- ```
75
-
76
- `HybridMambaAttentionDynamicCache` already inherits `DynamicCache.update()` which
77
- correctly maintains `key_cache[layer_idx]` / `value_cache[layer_idx]`.
78
-
79
- ---
80
-
81
- ## Result
82
-
83
- | Metric | Before Fix | After Fix |
84
- |---|---|---|
85
- | Max logit diff (cached vs uncached) | 10.89 | 0.23 |
86
- | First diverging layer | 12 | None (all within BF16 noise) |
87
- | Generation quality | Degenerate (repeating tokens) | Clean, coherent Lex-style questions |
88
- | `</think>` closing | Never (corrupted states) | Naturally at 200-3000 tokens |
89
-
90
- ---
91
-
92
- ## Complete Working Pipeline
93
-
94
- ```python
95
- from transformers import AutoModelForCausalLM, AutoTokenizer
96
- from tests.validate_correct_scan import patch_mamba_layers
97
- from ssm_generate import generate_cached
98
-
99
- model = AutoModelForCausalLM.from_pretrained(
100
- 'models/NVIDIA-Nemotron-3-Nano-4B',
101
- torch_dtype=torch.bfloat16,
102
- trust_remote_code=True,
103
- device_map='cuda'
104
- )
105
-
106
- # Apply all fixes in one call:
107
- # 1. Triton SSM scan for prefill (training + fast inference)
108
- # 2. fp32 decode step (matches llama.cpp/vLLM precision)
109
- # 3. Attention KV cache fix (passes past_key_value to NemotronHAttention)
110
- patch_mamba_layers(model, use_triton=True)
111
- model.eval()
112
-
113
- # Generate — now works correctly
114
- out = generate_cached(model, tokenizer, input_ids, max_new_tokens=1500)
115
- ```
116
-
117
- ---
118
-
119
- ## Why vLLM Was the Workaround
120
-
121
- vLLM's `nemotron_h.py` backend reimplements the attention forward correctly
122
- with its own KV cache management. It never hit this bug because it replaced
123
- the entire `NemotronHBlock.forward` with its own implementation.
124
-
125
- With this fix, the HF model + Triton pipeline produces equivalent output to vLLM
126
- for standard generation tasks.
127
-
128
- ---
129
-
130
- ## Verified Against vLLM
131
-
132
- Both HF (patched) and vLLM now produce:
133
- - Clean Lex-style interviewer questions
134
- - `</think>` closes in 200-3000 tokens depending on prompt
135
- - Consistent quality across diverse guests and topics
136
-
137
- The HF model is now suitable for on-policy GRPO generation without vLLM dependency.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TRITON_SSM_SCAN_PLAN.md DELETED
@@ -1,114 +0,0 @@
1
- # Triton SSM Scan Kernel for SM 12.x (GB10 Blackwell)
2
-
3
- **Status:** ✅ SUPERSEDED — Custom Triton kernel no longer needed.
4
- **Date superseded:** 2026-03-30
5
- **Original completion:** 2026-03-25/26
6
-
7
- ---
8
-
9
- ## ⚠️ SUPERSEDED — Read This First
10
-
11
- The custom Triton SSM scan kernel was built to work around the inability to use compiled `mamba_ssm` extensions on GB10 (SM 12.1). **That blocker no longer exists.**
12
-
13
- As of 2026-03-30, we successfully compiled `mamba_ssm 2.3.1` from source against `torch 2.10.0+cu130` in `.venv-train`. The compiled Triton kernels (`mamba_chunk_scan_combined`, `selective_state_update`) now run natively on GB10.
14
-
15
- ### Validation (2026-03-30)
16
-
17
- ```
18
- Seq=512, A=-0.1 (moderate): nan=False, inf=False, max=155 ✅
19
- Seq=512, A=-10 (extreme): nan=False, inf=False, max=200 ✅
20
- Forward + backward: 263 params with gradients, grad_norm=0.0212 ✅
21
- ```
22
-
23
- No BF16 underflow issues observed. The compiled kernel uses chunked scan with internal float32 accumulation for the state update, preventing the catastrophic underflow that affected the naive PyTorch implementation.
24
-
25
- ### What Was Wrong Originally
26
-
27
- The original GB10 problem (2026-03-24):
28
- - Pre-compiled `causal_conv1d` and `mamba_ssm` `.so` files targeted SM ≤ 12.0
29
- - Our aarch64 binaries had wrong architecture or ABI mismatches
30
- - Workaround: custom sequential scan + custom Triton kernel
31
-
32
- ### What Fixed It
33
-
34
- ```bash
35
- # In .venv-train (Python 3.12, torch 2.10+cu130):
36
- pip install git+https://github.com/state-spaces/mamba.git \
37
- --no-build-isolation --force-reinstall --no-deps
38
- ```
39
-
40
- Compiling from source lets the build system detect SM 12.1 and generate correct PTX/SASS via Triton JIT. The key was:
41
- 1. Having the right torch (CUDA build, matching ABI)
42
- 2. Compiling mamba_ssm from source (not using pre-built wheels)
43
- 3. Setting `LD_LIBRARY_PATH` so `libc10.so` is found by the dynamic linker
44
-
45
- ---
46
-
47
- ## Original Plan (Archived for Reference)
48
-
49
- The original plan was to build a custom Triton kernel because:
50
- - `mamba_ssm` compiled extensions failed on GB10
51
- - The pure-PyTorch sequential scan was correct but slow (~50s/step)
52
- - Triton JIT compiles at runtime → supports any SM including 12.x
53
-
54
- **This plan was executed and produced a working kernel (5.7x speedup)** stored in `ssm_scan_triton.py` and `tests/validate_correct_scan.py`. The `patch_mamba_layers()` function in `validate_correct_scan.py` uses this custom kernel.
55
-
56
- ### Current Recommendation
57
-
58
- **Do NOT use `patch_mamba_layers()` for new training runs.** Use the compiled `mamba_ssm` directly via `.venv-train`. The custom patch was a workaround that is now unnecessary and adds complexity.
59
-
60
- The `patch_mamba_layers()` code remains available for:
61
- - Debugging/comparison purposes
62
- - Environments where compiling mamba_ssm from source is not possible
63
-
64
- ---
65
-
66
- ## What We Learned
67
-
68
- ### 1. Pre-compiled wheels are architecture/ABI specific
69
- PyPI wheels for `mamba_ssm` and `causal_conv1d` are compiled against specific SM targets and torch ABI versions. When torch version changes or GPU architecture is new, they break silently.
70
-
71
- **Rule:** Always compile `mamba_ssm` from source in new environments (`--no-build-isolation`).
72
-
73
- ### 2. LD_LIBRARY_PATH is the key unlock
74
- `selective_scan_cuda.cpython-312-aarch64-linux-gnu.so` links against `libc10.so` which lives inside `torch/lib/`. Without LD_LIBRARY_PATH pointing there, the import fails with "undefined symbol" — even though the .so exists.
75
-
76
- ```bash
77
- TORCH_LIB=$(python3 -c "import torch; from pathlib import Path; print(Path(torch.__file__).parent/'lib')")
78
- export LD_LIBRARY_PATH="$TORCH_LIB:${LD_LIBRARY_PATH:-}"
79
- ```
80
-
81
- ### 3. The BF16 underflow problem was real but already solved
82
- Our original `ssm_scan_correct.py` fix was correct: `exp(cumsum(A))` in BF16 over long sequences → catastrophic underflow. The compiled `mamba_chunk_scan_combined` kernel handles this via chunked computation with fp32 accumulation at chunk boundaries. Verified on GB10 with extreme A=-10, seq_len=512: no NaN/Inf.
83
-
84
- ### 4. Custom Triton kernel still valuable for pure-PyTorch environments
85
- If you can't compile mamba_ssm from source (e.g., no nvcc, CI environment), the custom `patch_mamba_layers()` approach works. But for production training, the compiled kernel is faster and more battle-tested.
86
-
87
- ---
88
-
89
- ## Performance Comparison (Final)
90
-
91
- | Approach | Speed | Correctness | Complexity |
92
- |----------|-------|-------------|------------|
93
- | PyTorch sequential scan (original workaround) | ~55s/step | ✅ | Low |
94
- | Custom Triton kernel (`patch_mamba_layers`) | ~9.5s/step | ✅ | High |
95
- | **Compiled mamba_ssm (current)** | **~8.5s/step** | **✅** | **Low** |
96
-
97
- Compiled mamba_ssm is the clear winner: slightly faster than custom Triton, same correctness, zero custom code to maintain.
98
-
99
- ---
100
-
101
- ## Files (Status)
102
-
103
- | File | Status | Notes |
104
- |------|--------|-------|
105
- | `ssm_scan_correct.py` | Archive | Original sequential scan — reference only |
106
- | `ssm_scan_triton.py` | Archive | Custom Triton kernel — workaround, no longer needed |
107
- | `tests/validate_correct_scan.py` | Keep | Contains `patch_mamba_layers()` for legacy use |
108
- | `ssm_scan_backward.py` | Archive | Custom backward — superseded |
109
- | `.venv-train` | Active | Compiled mamba_ssm — USE THIS |
110
-
111
- ---
112
-
113
- *Original plan created: 2026-03-25*
114
- *Superseded: 2026-03-30 — compiled mamba_ssm works natively on GB10*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/VENV_SETUP.md DELETED
@@ -1,118 +0,0 @@
1
- # Venv Setup — Lex Fridman Interviewer Project
2
-
3
- Updated: 2026-03-30
4
-
5
- ## The Three Venvs
6
-
7
- ### `.venv-train` — SFT Training (PRIMARY)
8
-
9
- ```bash
10
- # Activate with required LD_LIBRARY_PATH:
11
- TORCH_LIB=/home/bobber/lex-ft/.venv-train/lib/python3.12/site-packages/torch/lib
12
- export LD_LIBRARY_PATH="$TORCH_LIB:${LD_LIBRARY_PATH:-}"
13
- source /home/bobber/lex-ft/.venv-train/bin/activate
14
- ```
15
-
16
- | Package | Version | Notes |
17
- |---------|---------|-------|
18
- | Python | 3.12.3 | |
19
- | torch | 2.10.0+cu130 | CUDA enabled, installed with `--index-url https://download.pytorch.org/whl/cu130` |
20
- | unsloth | 2026.3.17 | 2x faster training, memory optimizations |
21
- | mamba_ssm | 2.3.1 | **Compiled from source** against torch 2.10 — real Triton kernels |
22
- | transformers | 4.48.3 | |
23
- | datasets | 4.8.4 | |
24
- | accelerate | 1.13.0 | |
25
- | wandb | 0.25.1 | |
26
- | trl | 0.15.2 | |
27
-
28
- **Why LD_LIBRARY_PATH:** mamba_ssm's compiled `.so` links against `libc10.so` which is inside `torch/lib/`. The system linker doesn't find it without this path set.
29
-
30
- **Why NOT to use routangseng venv for training:** routangseng/.venv has Python 3.13 with broken mamba_ssm (selective_scan_cuda has undefined symbol ABI mismatch).
31
-
32
- ### `.venv-vllm` — Inference & Evaluation
33
-
34
- ```bash
35
- source /home/bobber/lex-ft/.venv-vllm/bin/activate
36
- ```
37
-
38
- | Package | Version | Notes |
39
- |---------|---------|-------|
40
- | Python | 3.12 | |
41
- | torch | 2.9+cu130 | |
42
- | vllm | 0.18.0 | Fast batch inference |
43
- | transformers | 5.3.0 | Has NemotronH built-in |
44
- | peft | 0.18.1 | For LoRA inference |
45
- | sentence_transformers | 5.3.0 | For embedding similarity |
46
- | datasets | | |
47
- | wandb | | |
48
-
49
- **Use for:** vLLM inference, `eval_functional_judge.py`, `judge_vllm.py`, `augment_with_base_model.py`
50
-
51
- **Note:** transformers 5.3.0 was manually upgraded from 4.57.6 (`pip install transformers==5.3.0`). vllm warns about incompatibility but works.
52
-
53
- ### `routangseng/.venv` — Legacy / Qwen
54
-
55
- ```bash
56
- source /home/bobber/routangseng-ft/.venv/bin/activate
57
- ```
58
-
59
- Used for: Qwen3.5-4B inference (judge), old SFT scripts. Not recommended for Nemotron SFT.
60
-
61
- ---
62
-
63
- ## Common Issues & Fixes
64
-
65
- ### ImportError: selective_scan_cuda undefined symbol
66
- **Cause:** mamba_ssm `.so` compiled against different torch ABI
67
- **Fix:** Recompile from source: `pip install git+https://github.com/state-spaces/mamba.git --no-build-isolation --force-reinstall --no-deps`
68
-
69
- ### PermissionError: /home/bobber/.cache/huggingface/modules/...
70
- **Cause:** Root-owned HF modules cache
71
- **Fix:** Set `HF_MODULES_CACHE=/home/bobber/lex-ft/.cache/hf_modules` before any HF imports
72
-
73
- ### torch.cuda.is_available() = False despite CUDA torch installed
74
- **Cause:** Missing `LD_LIBRARY_PATH` for `libcuda.so`/`libc10.so`
75
- **Fix:** Set `LD_LIBRARY_PATH` to torch's lib directory before running
76
-
77
- ### AttributeError: 'list' object has no attribute 'keys' (NemotronH + transformers)
78
- **Cause:** `_get_tied_weight_keys` bug in transformers for NemotronH
79
- **Fix:** Monkey-patch before model load (see `train_sft_v5.py` top of file)
80
-
81
- ### 'qwen3_5' KeyError in transformers 4.57.6
82
- **Cause:** transformers 4.57.6 doesn't know Qwen3.5 architecture
83
- **Fix:** Use routangseng/.venv (transformers 5.3.0) or .venv-vllm (upgraded to 5.3.0)
84
-
85
- ---
86
-
87
- ## Rebuild .venv-train From Scratch
88
-
89
- ```bash
90
- # 1. Create venv
91
- python3.12 -m venv /home/bobber/lex-ft/.venv-train
92
-
93
- # 2. Install CUDA torch
94
- .venv-train/bin/pip install torch==2.10.0+cu130 \
95
- --index-url https://download.pytorch.org/whl/cu130
96
-
97
- # 3. Set LD_LIBRARY_PATH for subsequent installs
98
- export LD_LIBRARY_PATH=/home/bobber/lex-ft/.venv-train/lib/python3.12/site-packages/torch/lib:$LD_LIBRARY_PATH
99
-
100
- # 4. Install training stack
101
- .venv-train/bin/pip install unsloth transformers datasets accelerate trl wandb
102
-
103
- # 5. Compile mamba_ssm from source (needed for NemotronH Mamba-2 layers)
104
- .venv-train/bin/pip install git+https://github.com/state-spaces/mamba.git \
105
- --no-build-isolation --force-reinstall --no-deps
106
-
107
- # 6. Verify
108
- .venv-train/bin/python3 -c "
109
- import torch; print('cuda:', torch.cuda.is_available())
110
- import mamba_ssm; from mamba_ssm.ops.triton.ssd_combined import mamba_chunk_scan_combined
111
- print('ssd_combined:', 'REAL' if mamba_chunk_scan_combined else 'None')
112
- from unsloth import FastLanguageModel; print('unsloth: OK')
113
- "
114
- ```
115
-
116
- ---
117
-
118
- *Created: 2026-03-30 04:35 UTC*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/VLLM_SETUP_NOTES.md DELETED
@@ -1,146 +0,0 @@
1
- # vLLM Setup Notes — DGX Spark (GB10, aarch64)
2
-
3
- Date: 2026-03-27
4
- Status: ✅ Working — vLLM 0.18.0 with NemotronH
5
-
6
- ---
7
-
8
- ## Previous Failure (v7 era)
9
-
10
- We tried vLLM earlier and it failed. The failure was caused by the same CUDA path issue as mamba-ssm:
11
- - `pip install vllm` pulled `torch 2.10.0+cpu` (CPU-only wheel) from default PyPI
12
- - No `libtorch_cuda.so` → `ImportError` on first import
13
- - We gave up and used llama.cpp server instead (off-policy GRPO)
14
-
15
- ## What Works Now
16
-
17
- vLLM 0.18.0 loads `NemotronH` natively via its own `nemotron_h.py` backend.
18
- **No `mamba-ssm` required** — vLLM has its own Mamba-2 implementation (`mamba2.py`).
19
-
20
- Confirmed working:
21
- - `</think>` closes naturally (P(</think>) is non-trivial with vLLM's kernel)
22
- - Batch generation works (multiple prompts in one call)
23
- - Output quality is good: clean Lex-style questions, correct structure
24
-
25
- Example output:
26
- ```
27
- ✅ [Andrej Karpathy] think_end=2384 ntok=668
28
- "That's a great observation — if I gave you a single cat image right now,
29
- how would a neural network actually recognize it?"
30
-
31
- ✅ [Elon Musk] think_end=3527 ntok=791
32
- "When comparing AI risks to climate change, what specific mechanisms do
33
- you see as making AI a greater existential threat?"
34
-
35
- ✅ [A quantum physicist] think_end=1630 ntok=392
36
- "If time emerges from entanglement, does that imply that the flow of
37
- time is an illusion, or is there a deeper emergent structure?"
38
- ```
39
-
40
- ---
41
-
42
- ## Installation
43
-
44
- ### Separate venv (required — don't pollute .venv-train)
45
-
46
- ```bash
47
- cd /home/bobber/lex-ft
48
- python3 -m venv .venv-vllm
49
- source .venv-vllm/bin/activate
50
-
51
- # Step 1: Install CUDA torch FIRST (must use cu130 index, not default PyPI)
52
- pip install torch==2.10.0+cu130 \
53
- --index-url https://download.pytorch.org/whl/cu130
54
-
55
- # Step 2: Install vLLM (uses the torch already installed)
56
- CUDA_HOME=/usr/local/cuda-13.0 \
57
- PATH=/usr/local/cuda-13.0/bin:$PATH \
58
- pip install vllm
59
- ```
60
-
61
- **Do NOT** run `pip install vllm` without installing CUDA torch first — PyPI will pull the CPU-only torch wheel.
62
-
63
- ### Verify
64
-
65
- ```python
66
- import torch
67
- print(torch.__version__) # 2.10.0+cu130
68
- print(torch.cuda.is_available()) # True
69
-
70
- import vllm
71
- print(vllm.__version__) # 0.18.0
72
- ```
73
-
74
- ---
75
-
76
- ## Loading NemotronH
77
-
78
- ```python
79
- from vllm import LLM, SamplingParams
80
-
81
- llm = LLM(
82
- model='models/NVIDIA-Nemotron-3-Nano-4B',
83
- trust_remote_code=True,
84
- max_model_len=4096,
85
- gpu_memory_utilization=0.55, # leaves ~57GB for training venv
86
- dtype='bfloat16',
87
- )
88
- ```
89
-
90
- Cold start: ~80s (loads 4B safetensors). Warm: instant.
91
-
92
- GPU memory: with `gpu_memory_utilization=0.55`, vLLM uses ~70GB, leaving ~57GB for the HF training model.
93
-
94
- ---
95
-
96
- ## Generation
97
-
98
- ```python
99
- from transformers import AutoTokenizer
100
- import re
101
-
102
- tok = AutoTokenizer.from_pretrained(
103
- 'models/NVIDIA-Nemotron-3-Nano-4B', trust_remote_code=True)
104
-
105
- # enable_thinking=True → prompt ends with <think>\n
106
- # Model generates thinking then closes </think> naturally
107
- msgs = [
108
- {'role': 'system', 'content': 'You are a Lex Fridman interviewer.\n\nGuest: Andrej Karpathy'},
109
- {'role': 'user', 'content': 'Neural networks are simple.'}
110
- ]
111
- prompt = tok.apply_chat_template(
112
- msgs, tokenize=False, add_generation_prompt=True, enable_thinking=True)
113
-
114
- params = SamplingParams(temperature=1.0, top_p=0.95, max_tokens=1500)
115
- outputs = llm.generate([prompt], params)
116
-
117
- raw = outputs[0].outputs[0].text
118
- te = raw.find('</think>')
119
- answer = re.sub(r'<\|[^>]+\|>', '', raw[te+8:]).strip() if te >= 0 else ''
120
- ```
121
-
122
- For `enable_thinking=False` (no-think mode):
123
- ```python
124
- prompt = tok.apply_chat_template(
125
- msgs, tokenize=False, add_generation_prompt=True, enable_thinking=False)
126
- # Prompt ends with <think></think> — model answers directly, no thinking phase
127
- ```
128
-
129
- ---
130
-
131
- ## Key Notes
132
-
133
- - vLLM uses its own Mamba-2 kernel (not mamba-ssm) — no mamba-ssm needed in .venv-vllm
134
- - The `nemotron_h.py` model backend handles the hybrid Mamba-2 + attention architecture
135
- - Warning `Add 3 padding layers, may waste at most 14.29% KV cache memory` is expected — harmless
136
- - `gpu_memory_utilization` must be set; default is 0.9 which will OOM when training venv also loads
137
- - For GRPO: use vLLM in one process, HF model in another (or sequential with GPU transfer)
138
-
139
- ---
140
-
141
- ## venv Summary
142
-
143
- | venv | Purpose | torch | key packages |
144
- |---|---|---|---|
145
- | `.venv-train` | HF training (LoRA + optimizer) | 2.11.0+cu130 | mamba-ssm 2.3.1, causal-conv1d 1.6.1, bitsandbytes |
146
- | `.venv-vllm` | vLLM generation | 2.10.0+cu130 | vllm 0.18.0, flashinfer 0.6.6 |