bobber commited on Apr 5

Commit

c3af311

verified ·

1 Parent(s): 257b921

Sync docs: group judge leaderboard (Qwen 3.5 27B + Gemma 4 31B), cloud model comparison (2026-04-05)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

docs/AUTONOMOUS_SESSION_2026-03-30.md +219 -0
docs/CURRENT_STATE_2026-03-20.md +95 -0
docs/CURRENT_STATE_2026-03-23.md +108 -0
docs/CURRENT_STATE_2026-03-26.md +304 -0
docs/CURRENT_STATE_2026-03-29.md +150 -0
docs/CURRENT_STATE_2026-03-30-evening.md +81 -0
docs/CURRENT_STATE_2026-03-30.md +159 -0
docs/DATA_CURATION_PLAN.md +131 -0
docs/EVAL_FRAMEWORK_2026-03-29.md +148 -0
docs/EVAL_RESULTS.md +319 -0
docs/FULL_FINETUNE_PLAN_2026-03-20.md +76 -0
docs/FUNCTIONAL_EVAL_DESIGN.md +134 -0
docs/GRPO_V11_DESIGN.md +141 -0
docs/GRPO_V11_POSTMORTEM.md +120 -0
docs/GRPO_V21_PLAN.md +78 -0
docs/GRPO_V21_SUCCESS_ANALYSIS.md +152 -0
docs/GRPO_V22_PLAN.md +61 -0
docs/GRPO_V23_PLAN.md +68 -0
docs/GRPO_V24_PLAN.md +33 -0
docs/GRPO_V3_POSTMORTEM.md +221 -0
docs/GRPO_V4_DESIGN.md +423 -0
docs/GRPO_V4_POSTMORTEM.md +129 -0
docs/GRPO_V7_DESIGN.md +164 -0
docs/GRPO_V8_CHANGES.md +55 -0
docs/GRPO_V8_ONPOLICY_PLAN.md +320 -0
docs/GRPO_V8_TRAINING_FLOW.md +184 -0
docs/KAGGLE_VS_OURS_COMPARISON.md +106 -0
docs/LEXFRIDMAN_INTERVIEWER_PLAN.md +183 -0
docs/LLAMA_FINETUNE_INVESTIGATION.md +44 -0
docs/LORA_V1_ANALYSIS.md +109 -0
docs/LORA_V2_NATIVE_RESULTS.md +66 -0
docs/MAMBA_SSM_BUILD_NOTES.md +114 -0
docs/NEMOTRON_GB10_DEEP_DIVE.md +321 -0
docs/NEMO_RL_SETUP_NOTES.md +168 -0
docs/ONNX_RETROSPECTIVE.md +404 -0
docs/OPTION2_SFT_DISTILLATION_PLAN.md +193 -0
docs/RETROSPECTIVE_2026-03-31.md +168 -0
docs/REWARD_V10_DESIGN.md +175 -0
docs/REWARD_V11_DESIGN.md +187 -0
docs/REWARD_V13_DESIGN.md +106 -0
docs/RL_VS_FILTERING_ANALYSIS_2026-03-30.md +107 -0
docs/SSM_SCAN_FIX_PLAN.md +202 -0
docs/SYNTHETIC_DATA_ANALYSIS_2026-03-30.md +153 -0
docs/TECHNICAL_CHALLENGES.md +181 -0
docs/TECHNICAL_REVIEW_2026-03-23.md +205 -0
docs/TRAINING_PLAN_V5.md +143 -0
docs/TRITON_PIPELINE_FIX.md +137 -0
docs/TRITON_SSM_SCAN_PLAN.md +114 -0
docs/VENV_SETUP.md +118 -0
docs/VLLM_SETUP_NOTES.md +146 -0

docs/AUTONOMOUS_SESSION_2026-03-30.md ADDED Viewed

	@@ -0,0 +1,219 @@

+# Autonomous Session — 2026-03-30 05:44 UTC to 13:00 UTC (9AM EST)
+## Authorization
+Bobber authorized autonomous decisions at 05:44 UTC.
+All decisions, assumptions, and results logged here in real time.
+## Starting State
+- **SFT v5 training:** running, step ~40/897, loss 8.2, ETA ~07:45 UTC
+- **W&B run:** `lex-sft-v5-4k-bnb8bit` → https://wandb.ai/bobber-cheng/lex-interviewer/runs/udqlwz88
+- **Base model benchmark:** 0.653 ± 0.333 (3-judge functional eval)
+- **Previous best SFT:** 0.467 (v4, 201 pairs — catastrophic forgetting)
+## Decision Framework
+### After eval completes:
+| Result | Interpretation | Action |
+|--------|---------------|--------|
+| score > 0.70 | Clear improvement — training working | Launch GRPO from sft-v5 checkpoint with reward_v11 |
+| 0.653–0.70 | Marginal improvement | Run LoRA variant (r=64, same data) to compare |
+| 0.60–0.653 | Slight degradation | Investigate: check if generated pairs (4,075) are hurting. Retrain on real Lex only (697 pairs) |
+| < 0.60 | Significant degradation | Likely overfitting. Check loss curve — if loss < 5, training ran too long. Try 1 epoch only |
+| Gibberish / < 0.30 | Complete failure | Model merge issue. Check generated questions manually first |
+### Key assumptions:
+1. Current loss trajectory (~8 at step 40, should reach 5-7 by step 897) = healthy full convergence
+2. The 4,075 generated pairs are useful if they passed 3/3 judges — not perfect but signal
+3. 3-judge eval on contaminated held-out set is the only eval available; score differences > 0.05 are meaningful
+4. GRPO is the right follow-up IF SFT beats base — it starts from a stronger position
+---
+## Actions Log
+### 05:44 UTC — Session start
+- Verified training running: step ~40, loss 8.2, GPU 96%, ETA ~07:45 UTC
+- Will not restart training (5 restarts already, converging well, let it finish)
+- Set automation for: eval on completion, decision logging, doc sync
+### [In progress] Training monitoring
+- Checking every 30 min
+- Will auto-run eval when models/sft-v5 directory is populated
+### 06:22 UTC — Monitor check (cron)
+- Training still running: step 285/897, epoch 0.95, loss 4.79
+- Loss trajectory: 8.2→4.8 over ~285 steps — healthy convergence
+- GPU utilization: 96%
+- Rate: ~5.3 steps/min → ETA completion: ~08:15 UTC (4:15 AM EDT)
+- Last 5 logged losses: 3.69, 4.25, 4.40, 4.46, 4.79 (some variance but trending down overall)
+- No action needed — training proceeding normally
+---
+*This document is updated in real time as decisions are made.*
+*Created: 2026-03-30 05:44 UTC*
+### 06:22 UTC — Monitor: training still running
+- Latest W&B run: run-20260330_052927-udqlwz88
+- Last metrics: {'loss': '4.464', 'grad_norm': '29.75', 'learning_rate': '8.099e-06', 'epoch': '0.938'}
+- Tmux progress:  32%|███▏      | 284/897 [52:27<1:39:24,  9.73s/it]
+### 06:42 UTC — Monitor: training still running
+- Latest W&B run: run-20260330_052927-udqlwz88
+- Last metrics: {'loss': '2.802', 'grad_norm': '30.5', 'learning_rate': '6.246e-06', 'epoch': '1.322'}
+- Tmux progress:  44%|████▍     | 397/897 [1:12:32<1:20:51,  9.70s/it]
+### 07:01 UTC — Monitor: training still running
+- Step 502/897 (56%), epoch 1.67, loss ~2.7-2.9
+- Loss has stabilized in 2.5-3.1 range (down from 8.2 at start, 4.8 at step 285)
+- "Writing model shards 100%" was a mid-training checkpoint save, not final — training resumed after
+- Rate: ~5.3 steps/min → ETA completion: ~08:15-08:30 UTC (4:15-4:30 AM EDT)
+- Recent losses: 2.703, 2.953, 3.115, 3.090, 2.902, 2.559, 2.978, 2.807, 2.667
+- No action needed — training proceeding normally, checkpoint saving working
+### 07:21 UTC — Monitor: training still running
+- Latest W&B run: run-20260330_052927-udqlwz88
+- Last metrics: {'loss': '2.109', 'grad_norm': '21.5', 'learning_rate': '2.484e-06', 'epoch': '2.04'}
+- Tmux progress:  68%|██████▊   | 613/897 [1:52:30<43:00,  9.09s/it]
+### 07:42 UTC — Monitor: training still running
+- Latest W&B run: run-20260330_052927-udqlwz88
+- Last metrics: {'loss': '1.855', 'grad_norm': '24.75', 'learning_rate': '1.004e-06', 'epoch': '2.409'}
+- Tmux progress:  80%|████████  | 722/897 [2:12:26<28:07,  9.64s/it]
+### 07:42 UTC — Monitor check (cron)
+- Training still running: step 724/897 (81%), epoch ~2.41, loss 1.86
+- Loss trajectory: 8.2 → 4.8 → 2.8 → 1.86 — steady convergence across 3 epochs
+- Learning rate near zero (1e-6), final cooldown phase
+- Rate: ~9.6s/step → ETA completion: ~08:10 UTC (4:10 AM EDT)
+- ~173 steps remaining (~28 min)
+- No action needed — training in final stretch, model save + auto-eval expected shortly after
+### 08:02 UTC — Monitor: training still running
+- Latest W&B run: run-20260330_052927-udqlwz88
+- Last metrics: {'loss': '2.008', 'grad_norm': '21.62', 'learning_rate': '1.297e-07', 'epoch': '2.794'}
+- Tmux progress:  93%|█████████▎| 835/897 [2:32:30<09:17,  8.99s/it]
+### 08:01 UTC — Monitor check (cron)
+- Training still running: step 836/897 (93%), epoch ~2.79, loss 2.01
+- Learning rate near zero (1.3e-7), final steps
+- Rate: ~9.1s/step → ETA completion: ~08:10 UTC (4:10 AM EDT)
+- ~61 steps remaining (~9 min)
+- models/sft-v5 not yet saved — training hasn't finished writing
+- functional_judge_sft_v5_vs_base.json does NOT exist yet
+- No action needed — training in final stretch, auto-eval will trigger on next monitor cycle
+### 08:22 UTC — Training finished; launching auto eval + decision script
+### 08:22 UTC — Auto-eval session starting
+### 08:22 UTC — Model found: 1 safetensors files
+### 08:22 UTC — Running functional judge eval: sft-v5 vs base
+### 08:22 UTC — Auto-eval failed (LoRA detection bug)
+- `auto_eval_and_decide.py` called `eval_functional_judge.py` with absolute path `/home/bobber/lex-ft/models/sft-v5`
+- `eval_functional_judge.py` used `model_path.startswith('models/')` to detect full fine-tune vs LoRA
+- Absolute path didn't match `models/` prefix → classified as LoRA → vLLM `enable_lora=True` → crash
+- **Fix applied:** Added `adapter_config.json` presence check + `model.safetensors` presence check for robust detection
+### 08:28 UTC — Second failure: weight key mismatch
+- vLLM loaded `models/sft-v5` as full model but crashed: `KeyError: 'embedding.weight'`
+- Investigation: SFT training saved `backbone.embedding.weight` (singular) but base model uses `backbone.embeddings.weight` (plural)
+- **Fix applied:** Renamed key in safetensors file (backup at `model.safetensors.bak`)
+- This is a known quirk of the NemotronH model's HuggingFace vs internal naming
+### 08:34 UTC — Manual re-run of eval (fixed)
+- Ran eval_functional_judge.py directly with both fixes applied
+- Base eval: 0.653 ± 0.333 (consistent with prior runs)
+- SFT-v5 eval: 0.667 ± 0.377
+## Eval Results
+| Model | Score | on_topic | uses_guest | probing | Avg Words |
+|-------|-------|----------|------------|---------|-----------|
+| Base  | 0.653 | 68% | 48% | 80% | 15 |
+| SFT v5 | 0.667 | 76% | 60% | 64% | 15 |
+| Delta | +0.014 | +8% | +12% | -16% | 0 |
+## Decision: MARGINAL / NEUTRAL
+**Delta +0.014 is within noise (n=25, std ~0.35).**
+**Detailed analysis:**
+- **on_topic:** 68% → 76% (+8%) — slight improvement ✅
+- **uses_guest:** 48% → 60% (+12%) — meaningful improvement in referencing guest content ✅
+- **probing:** 80% → 64% (-16%) — significant degradation ❌ — SFT model asks more surface-level follow-ups
+**Interpretation:** SFT v5 learned to reference the guest's content better (uses_guest +12%) but lost depth (probing -16%). The net effect is roughly neutral. 3 epochs of full fine-tune on 4,075 pairs shifted the model's question style but didn't clearly improve overall quality.
+**No auto-launch of GRPO.** Per the decision framework, this falls in the "marginal/neutral" bucket.
+**Recommendations for Bobber:**
+1. Try LoRA (r=64) instead of full fine-tune — less catastrophic style shift
+2. Try 1 epoch only — the loss went from 8.2→1.86 which may be overfitting
+3. Filter training data more aggressively — the 4,075 generated pairs may be diluting signal
+4. Consider that probing degradation suggests the model is memorizing surface patterns
+### 08:41 UTC — Session wrapping up
+- Eval results saved to `results/functional_judge_sft_v5_vs_base.json`
+- Disabling overnight monitor cron job
+- Cleaning up stale cron jobs
+### 08:28 UTC — Eval failed with code 1
+### 08:28 UTC — Eval failed — manual intervention needed
+### 08:28 UTC — Auto eval failed
+^^^(EngineCore pid=89624)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
+^(EngineCore pid=89624)     return self.__get_result()
+^(EngineCore pid=89624)            ^^^^^^^^^^^^^^^^^^^
+^^^(EngineCore pid=89624)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
+^(EngineCore pid=89624)     raise self._exception
+^^^(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 82, in collective_rpc
+^^(EngineCore pid=89624)     result = run_method(self.driver_worker, method, args, kwargs)
+^^(EngineCore pid=89624)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
+^^(EngineCore pid=89624)     return func(*args, **kwargs)
+^(EngineCore pid=89624)            ^^^^^^^^^^^^^^^^^^^^^
+^^^(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 332, in execute_model
+^(EngineCore pid=89624)     return self.worker.execute_model(scheduler_output)
+(EngineCore pid=89624)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 817, in get_output
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
+(EngineCore pid=89624)     return func(*args, **kwargs)
+(EngineCore pid=89624)            ^^^^^^^^^^^^^^^^^^^^^
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 822, in execute_model
+(EngineCore pid=89624)     output = self.model_runner.execute_model(
+(EngineCore pid=89624)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
+(EngineCore pid=89624)     return func(*args, **kwargs)
+(EngineCore pid=89624)            ^^^^^^^^^^^^^^^^^^^^^
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3625, in execute_model
+(EngineCore pid=89624)     logits_indices, spec_decode_metadata = self._prepare_inputs(
+(EngineCore pid=89624)                                            ^^^^^^^^^^^^^^^^^^^^^
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1877, in _prepare_inputs
+(EngineCore pid=89624)     self.set_active_loras(
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/lora_model_runner_mixin.py", line 89, in set_active_loras
+(EngineCore pid=89624)     return self._set_active_loras(
+(EngineCore pid=89624)            ^^^^^^^^^^^^^^^^^^^^^^^
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/v1/worker/lora_model_runner_mixin.py", line 67, in _set_active_loras
+(EngineCore pid=89624)     self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/lora/worker_manager.py", line 176, in set_active_adapters
+(EngineCore pid=89624)     self._apply_adapters(requests)
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/lora/worker_manager.py", line 263, in _apply_adapters
+(EngineCore pid=89624)     self.add_adapter(lora)
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/lora/worker_manager.py", line 279, in add_adapter
+(EngineCore pid=89624)     lora = self._load_adapter(lora_request)
+(EngineCore pid=89624)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+(EngineCore pid=89624)   File "/home/bobber/lex-ft/.venv-vllm/lib/python3.12/site-packages/vllm/lora/worker_manager.py", line 151, in _load_adapter
+(EngineCore pid=89624)     raise LoRAAdapterNotFoundError(
+(EngineCore pid=89624) vllm.exceptions.LoRAAdapterNotFoundError: Loading lora adapter failed: No adapter found for /home/bobber/lex-ft/models/sft-v5
+    raise self._format_exception(outputs) from None
+vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
+Processed prompts:   0%|          | 0/25 [00:02<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
+[08:28 UTC] Eval failed with code 1
+[08:28 UTC] Eval failed — manual intervention needed

docs/CURRENT_STATE_2026-03-20.md ADDED Viewed

	@@ -0,0 +1,95 @@

+# Current State — Lex Fridman Interviewer Project
+Updated: 2026-03-20 UTC
+Project root: `/home/bobber/lex-ft`
+HF docs target: `bobber/lex-fridman-interviewer-project`
+## 1) What exists right now
+### Data
+- Main current training dataset: `data/interview_segments_v2.jsonl`
+- Size: **7,580 segments**
+- Validation status: **PASS** (`logs/validate_v2.log`)
+- Key stats from validation:
+  - user-first segments: **75.5%**
+  - assistant turns that are questions: **76%**
+  - target-quality score: **4.95/6**
+  - assistant length: P50 **26**, P75 **50**, P95 **112**, max **521** words
+- Important correction: older docs still say v2 data "needs validation"; that is now stale. The validation already passed.
+### Training artifacts
+- Base model folder: `models/NVIDIA-Nemotron-3-Nano-4B`
+- LoRA/SFT v1 adapter: `models/lex-interviewer-sft`
+- LoRA/SFT v2 adapter: `models/lex-interviewer-sft-v2`
+- Merged-ish export folder for v2 eval/gguf prep: `models/lex-interviewer-v2-gguf`
+- Q8 GGUF export of v2: `models/lex-interviewer-v2-gguf_gguf/NVIDIA-Nemotron-3-Nano-4B.Q8_0.gguf`
+### Scripts that matter
+- `scripts/train_sft.py` — current Unsloth training script (still LoRA-based, despite v2 data improvements)
+- `scripts/validate_training_data.py` — pre-train data gate
+- `scripts/restructure_data.py` — builds v2 data
+- `scripts/eval_via_server.py` — llama.cpp server eval
+- `scripts/eval_openai.py`, `scripts/eval_anthropic.py`, `scripts/eval_gemini.py`, `scripts/eval_gemini31.py`
+- `scripts/overnight_pipeline.sh` — Q8 eval pipeline + validation + training + export/eval chain
+## 2) Current scorecard
+### Best current interviewer behavior
+- **Nemotron 4B base**: about **4.35/5**
+- Still the strongest local interviewer behavior seen in this project so far
+### Larger model references
+- Nemotron 30B Q8: **4.25/5**
+- Qwen3.5-35B-A3B Q8: **3.55/5**
+- Qwen3.5-27B Q8: **3.25/5**
+### Fine-tuned models
+- SFT v1: poor
+- SFT v2 trained result: **2.00/5**, ~**292 words** average
+- Failure mode: lectures/monologues instead of sharp interviewer questions
+## 3) What the repo tells us technically
+### The current training script is still LoRA, not full fine-tune
+`train_sft.py` currently:
+- loads in 4-bit
+- applies LoRA with
+  - `r=16`
+  - target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
+- trainable params: **10,119,168 / 2,661,488,224 = 0.38%**
+This means the latest training run that produced the bad v2 result was **not** a full fine-tune. It was still adapter training on a Nemotron hybrid architecture.
+### Why that matters
+This strongly supports the current project decision:
+- **production path:** try **full fine-tune** next
+- **research/debug path:** inspect whether LoRA touched meaningful parts of the hybrid architecture later
+## 4) Most likely current diagnosis
+There are two different truths now:
+1. **The v2 dataset looks materially better and passes validation**
+2. **The current v2 trained model is still bad**
+That points away from "data is still obviously broken" and more toward one or more of:
+- LoRA is a poor fit for this Nemotron hybrid/Mamba-ish architecture
+- adapter targets are not reaching the behavior-critical parts of the model
+- training format/template still induces long assistant completions instead of concise interviewer questions
+- eval-time export/inference path may preserve some formatting/pathology from the LoRA route
+## 5) Practical state of the project right now
+If we had to continue from scratch today, the cleanest reading is:
+- the **dataset is good enough to justify another run**
+- the **current LoRA route is the thing under suspicion**
+- the next serious experiment should be **Unsloth full fine-tune**, not another LoRA-only iteration
+## 6) Immediate next action recommended
+Run a controlled **full fine-tune SFT** on `data/interview_segments_v2.jsonl` with:
+- step-50 checkpoint/eval gate
+- same eval harness as current leaderboard
+- no claims of success until the step-50 model beats base on actual interviewer eval
+## 7) Known housekeeping issue
+- `/home/bobber/lex-ft` is **not a git repo** right now, so changes here are not versioned unless manually synced elsewhere.

docs/CURRENT_STATE_2026-03-23.md ADDED Viewed

	@@ -0,0 +1,108 @@

+# Current State — Lex Fridman Interviewer Project
+Updated: 2026-03-23 UTC
+Project root: `/home/bobber/lex-ft`
+HF docs target: `bobber/lex-fridman-interviewer-project`
+## 1) What exists right now
+### Data
+- Training dataset: `data/interview_segments_v2.jsonl` — **7,580 segments**, validated ✅
+- 113 episodes crawled from lexfridman.com (official human transcripts with speaker labels)
+- Validation: 75.5% user-first, 76% question targets, avg score 4.95/6
+### Training artifacts
+| Artifact | Type | Status | Score |
+|----------|------|--------|-------|
+| `models/NVIDIA-Nemotron-3-Nano-4B` | Base model | ✅ Best performer | **4.35/5** |
+| `models/lex-interviewer-sft` | LoRA SFT v1 | ❌ Failed | 2.10/5 |
+| `models/lex-interviewer-sft-v2` | LoRA SFT v2 | ❌ Failed | 2.00/5 |
+| `models/lex-interviewer-grpo-lora-v3` | LoRA GRPO v3 (125 steps) | ❌ Failed | N/A (gibberish) |
+| `models/lex-interviewer-grpo-lora-v3-step{25,50,75,100,125}` | GRPO checkpoints | ❌ Failed | N/A (gibberish) |
+### Scripts
+| Script | Purpose |
+|--------|---------|
+| `scripts/train_grpo_v3.py` | GRPO v3 training (off-policy, proven broken) |
+| `scripts/reward_v3.py` | Heuristic reward for GRPO |
+| `scripts/train_sft.py` | SFT training (Unsloth LoRA) |
+| `scripts/validate_training_data.py` | Pre-train data gate |
+| `scripts/eval_via_server.py` | Eval via llama.cpp server |
+## 2) What happened with GRPO v3
+**Full analysis:** `docs/GRPO_V3_POSTMORTEM.md`
+GRPO v3 used a hybrid architecture: llama.cpp (base model, Q4_K_M) for generation + HF transformers (base + LoRA) for log-prob computation and gradient updates. Training ran for 125 steps (9.72h) with positive reward metrics throughout.
+**Result:** The merged LoRA model generates complete gibberish. Training rewards were misleading — they measured the base model's generation quality, not the LoRA's.
+**6 critical gaps identified:**
+1. 🔴 Off-policy generation — generator ≠ learner
+2. 🔴 No KL reference — no anchor preventing divergence
+3. 🔴 Token truncation — 512 max_length vs 800 max_tokens
+4. 🟡 Architecture mismatch — LoRA can't modify Mamba SSM dynamics
+5. 🟡 No credit assignment — uniform token weighting
+6. 🟡 Thinking in training, not in reward — gradient/reward mismatch on `<think>` tokens
+## 3) Current scorecard
+| Rank | Model | Score | Notes |
+|------|-------|-------|-------|
+| 🥇 | **Nemotron 4B base** | **4.35/5** | System prompt only, no fine-tuning |
+| 🥈 | GPT-5.4 | 4.30/5 | Cloud API |
+| 🥉 | Nemotron 30B-A3B Q8 | 4.25/5 | 7.5x larger, marginal improvement |
+| 4 | Gemini 3.1 Pro | 3.70/5 | Verbose |
+| 4 | Claude Opus 4.6 | 3.70/5 | Too wordy (121 words avg) |
+| ❌ | All fine-tuned variants | ≤3.20/5 | SFT, full SFT, and GRPO all worse than base |
+**Key insight:** A free, local 4B model beats all cloud APIs on this task. Every fine-tuning attempt has made it worse.
+## 4) Why fine-tuning keeps failing
+Three different approaches have now failed:
+| Approach | Why it failed |
+|----------|--------------|
+| LoRA SFT (v1, v2) | Only 0.38% params trained; 38/42 Mamba layers untouched |
+| Full SFT (v1) | Think-tag format mismatch — good output trapped in `<think>` |
+| Full SFT (v2) | Data quality issues — `user\n` contamination, low question rate |
+| GRPO v3 (LoRA) | Off-policy RL — 6 critical gaps between reward and learning |
+Common thread: the Nemotron hybrid Mamba-2 architecture is hostile to standard fine-tuning approaches. The base model's interviewer behavior comes from its pre-training, and current methods either can't reach it (LoRA) or corrupt it (full SFT format issues, off-policy RL divergence).
+## 5) Viable next steps
+### Option A: Ship the base model (recommended)
+The base model already outperforms every cloud API tested. Deploy it with a good system prompt. No fine-tuning needed.
+### Option B: SFT on curated GRPO completions
+GRPO training logged ~1000 completions. Filter for reward ≥ 0.5 to get high-quality Lex-style questions generated by the base model. Use these as supervised training data — no off-policy gap, no reward mismatch. This is distillation from the base model's own best outputs.
+### Option C: On-policy GRPO
+Periodically merge LoRA → GGUF → use merged model for generation. Fixes the fatal off-policy gap but adds significant engineering complexity (merge + convert cycle every N steps) and was proven to produce garbage at merge time.
+### Option D: Full SFT v3 with clean data
+Use the validated v2 data with proper chat template format. Previous full SFT v2 showed promise (3.20/5 vs 2.00/5 for LoRA) but still below base. Would need v3+ data with stricter quality filtering.
+## 6) Disk situation
+```
+/dev/nvme0n1p2  3.7T  3.4T  128G  97%  /
+```
+128 GB free after cleaning 128 GB of HuggingFace cache. The remaining locked cache (gpt-oss-120b, GLM-4.6V) requires `sudo` to remove.
+## 7) Key files for reference
+| File | What it tells you |
+|------|-------------------|
+| `docs/GRPO_V3_POSTMORTEM.md` | Why off-policy GRPO failed (6 gaps) |
+| `docs/EVAL_RESULTS.md` | Full eval leaderboard |
+| `docs/TECHNICAL_CHALLENGES.md` | All technical challenges + resolutions |
+| `logs/grpo_v3_completions.jsonl` | All GRPO training completions + rewards |
+| `logs/run_grpo_v3.log` | GRPO training log |
+---
+*Previous state: `docs/CURRENT_STATE_2026-03-20.md`*

docs/CURRENT_STATE_2026-03-26.md ADDED Viewed

	@@ -0,0 +1,304 @@

+# Current State — Lex Fridman Interviewer Project
+Updated: 2026-03-26 23:54 UTC
+Project root: `/home/bobber/lex-ft`
+HF docs target: `bobber/lex-fridman-interviewer-project`
+---
+## 1) Where We Are
+### The core result
+**Base Nemotron 4B (GGUF, llama.cpp) scores 7.12/10 on the interviewer eval.**
+Every fine-tuning attempt so far has degraded the model. We now understand *why* — and have fixed the critical generation bug.
+| Approach | Score | Status |
+|---|---|---|
+| **Base model (llama.cpp GGUF)** | **7.12/10** | ✅ Best |
+| SFT v4 Triton (250 steps, LoRA) | 5.36/10 | ❌ Below base |
+| SFT v2 (100 steps, LoRA) | 5.08/10 | ❌ Below base |
+| GRPO v3 (off-policy, 125 steps) | gibberish | ❌ Off-policy broken |
+| GRPO v4 (mock rmsnorm) | oscillating loss | ❌ Mock kernel broken |
+| GRPO v6 runs 1-10 (full FT) | collapse/garbage | ❌ gen_max_tokens bug + unstable |
+| **GRPO v7 run10 (current)** | step 0: 4.04/5 best | 🔄 Running |
+---
+## 2) The Critical Bug — gen_max_tokens Was Way Too Small
+**This is the most important finding from today's session.**
+Nemotron 4B is a thinking model. Every response goes through a `<think>` block before answering. Token budget:
+- Thinking phase: ~600–1100 tokens (varies with prompt)
+- Actual answer (the question): ~20–100 tokens
+- **Total needed: ~700–1200 tokens minimum**
+All previous GRPO runs (v3–v6) used `gen_max_tokens=300` or `800`. This meant:
+- The model was **mid-thinking** when generation was cut off
+- `strip_thinking()` found no `</think>` → returned empty string
+- `reward_fn("", ...)` → **reward = 0.0 on every completion**
+- The only completions scoring >0 were those that skipped thinking entirely
+- GRPO was **actively training the model to not think** → template collapse
+Evidence: llama.cpp server with `max_tokens=4000` produces clean 14–27 word Lex-style questions scoring 4.0+/5. Same prompt with `max_new_tokens=800` → truncated mid-think → empty → reward=0.
+### strip_thinking() was also wrong
+Old (broken):
+```python
+re.sub(r'<think>.*?</think>', '', text)  # only removes closed think blocks
+```
+New (correct):
+```python
+if '</think>' in text:
+    return text[text.index('</think>') + len('</think>'):].strip()
+elif '<think>' in text:
+    return ''  # truncated mid-think — discard
+else:
+    return text.strip()  # no thinking block — use as-is
+```
+---
+## 3) GRPO v6 — Full Fine-Tune Collapse Analysis
+Ran 10+ full fine-tune runs (42 layers, ~70GB model) before diagnosing the root cause.
+### What happened in each run
+**Runs 1–10:** Various hyperparameters, all failed the same way:
+- Steps 0–20: Semi-coherent output, rewards mostly 0.0 (from gen_max_tokens truncation)
+- Steps 20–50: Template collapse into "We need to respond to user. Probably the user is asking..."
+- Steps 50+: Mode collapse — all 4 completions identical → zero advantage → zero gradient
+**Run 5 specific failure:** Triton workspace OOM — the backward kernel allocated a new ~3GB workspace each step (size varies with sequence length). After 3 steps, CUDA allocator fragmented. Step 3 took **51 minutes** to find a contiguous 3GB block.
+Fix applied: persistent pre-allocated workspace stored as function attribute `ssm_scan_triton_bwd._triton_bwd_workspace`. Same buffer reused every step.
+### Root cause: full fine-tune is too unstable for this task
+- LoRA targets only 4 attention layers (1.03% of params) — stable
+- Full fine-tune trains all 42 layers — one bad gradient step destroys base model behavior
+- lr=1e-5 with no warmup → immediate catastrophic update
+---
+## 4) GRPO v7 Design
+LoRA + correct token budget + warmup + llama.cpp generation.
+### Architecture
+```
+Per step:
+  1. llama.cpp server generates 4 completions (enable_thinking=True via GGUF)
+     → server returns content (visible answer) + reasoning_content (thinking)
+     → no stripping needed — llama.cpp handles </think> boundary natively
+  2. reward_fn(content, prompt) for each completion
+  3. GRPO advantages (normalize within group)
+  4. HF model forward: compute log-probs on full completion (think + answer tokens)
+  5. reference_lp: LoRA disabled (same weights, base model behavior)
+  6. loss = -adv * mean_lp_visible + kl_coef * KL(policy||ref)
+  7. loss.backward(); optimizer.step()
+```
+### Config (run10, currently running)
+```
+--steps 500
+--lr 2e-5
+--kl-coef 0.05
+--generations 4
+--gen-tokens 4000      # safe headroom for full thinking chain
+--force-think-close 800
+--warmup 30            # linear warmup to avoid early catastrophic update
+LoRA r=32, alpha=64, targets: q/k/v/o/gate/up/down_proj
+```
+### Generation: llama.cpp server (off-policy, for now)
+The HF model in Python has near-zero P(`</think>`) — it starts thinking but never closes the block. This is model behavior: NVIDIA trained it with their own reasoning infrastructure (vLLM + `selective_state_update` CUDA kernel).
+llama.cpp handles this via `reasoning_format: "deepseek"` + `thinking_forced_open: true` — it detects reasoning content and manages `</think>` injection.
+**Off-policy gap:** generations come from GGUF model, training updates HF model. This is the same gap that broke GRPO v3. Importance sampling correction is not yet implemented.
+### run10 progress (as of 23:54 UTC)
+```
+Step 0: reward mean=1.928  best: "When you say the self disappears during meditation,
+         how does that experience feel different from ordinary states of mind?" → 4.04/5
+Step 1: reward mean=0.980  best: "What do you think the most profound consequence of
+         unregulated genetic selection for intelligence might be, beyond the obvious?" → 3.92/5
+```
+- Log: `results/grpo_v7_run10_stdout.log`
+- W&B: `wejcyyj5` — https://wandb.ai/bobber-cheng/lex-interviewer/runs/wejcyyj5
+- PID: 35834 (PGID=SID=35834, properly detached)
+- llama-server: PID 38286, port 30000
+---
+## 5) Major Breakthrough: mamba-ssm + causal-conv1d Built on GB10
+### The problem
+All previous HF-based generation fell back to Python (slow, wrong decode):
+```
+WARNING: The fast path is not available because one of
+(selective_state_update, causal_conv1d_fn, causal_conv1d_update) is None.
+Falling back to the naive implementation.
+```
+Without `selective_state_update`, the decode step runs in Python with BF16 — and produces near-zero P(`</think>`) because the SSM state diverges from training conditions.
+### Root cause of build failure
+DGX Spark uses **CUDA 13.0**, but `/usr/bin/nvcc` symlinks to a **CUDA 12.0** toolchain. `pip install mamba-ssm` picked up the wrong compiler and failed with CUDA version mismatch.
+The actual CUDA 13.0 nvcc is at `/usr/local/cuda-13.0/bin/nvcc`.
+### Fix
+```bash
+CUDA_HOME=/usr/local/cuda-13.0 \
+PATH=/usr/local/cuda-13.0/bin:$PATH \
+TORCH_CUDA_ARCH_LIST="12.0" \
+pip install mamba-ssm causal-conv1d --no-build-isolation
+```
+Compiled aarch64 CUDA kernels for SM 12.1 (Blackwell). Build took ~45 minutes.
+### Result
+```python
+import mamba_ssm
+from mamba_ssm.ops.triton.selective_state_update import selective_state_update
+# → <function selective_state_update at 0xf03c0de6bce0>  ✅ (not None)
+import causal_conv1d
+# → causal_conv1d_fn: <function causal_conv1d_fn at ...>  ✅ (not None)
+```
+Installed versions:
+- `mamba_ssm-2.3.1-cp312-cp312-linux_aarch64.whl` (351 MB)
+- `causal_conv1d-1.6.1`
+- Cached at: `~/.cache/pip/wheels/28/83/54/d45107838...`
+### What this unlocks
+With `selective_state_update` available, the HF model decode runs via the real CUDA kernel (same code path as NVIDIA's training). This should fix the P(`</think>`) ≈ 0 issue and enable **fully on-policy GRPO** — removing the off-policy gap entirely.
+Testing in progress at session end. If confirmed working, GRPO v8 will switch generation from llama.cpp to the HF model directly.
+---
+## 6) vLLM — Working ✅ (2026-03-27)
+vLLM 0.18.0 successfully loads NemotronH via its own `nemotron_h.py` backend. **No mamba-ssm needed** — vLLM has its own Mamba-2 kernel.
+**Previous attempts failed** because `pip install vllm` pulled `torch 2.10.0+cpu` (CPU-only). Fix: install CUDA torch first.
+```bash
+cd /home/bobber/lex-ft && source .venv-vllm/bin/activate
+pip install torch==2.10.0+cu130 --index-url https://download.pytorch.org/whl/cu130
+pip install vllm
+```
+Confirmed working: `</think>` closes naturally, batch generation works, output quality is good.
+See `docs/VLLM_SETUP_NOTES.md` for full installation and usage guide.
+---
+## 7) NVIDIA's Actual Training Method (from NeMo repo)
+From reviewing the [NeMo Nemotron training branch](https://github.com/NVIDIA-NeMo/Nemotron/tree/nano-3-training):
+- **Generation:** vLLM (which has native Mamba-2 CUDA support)
+- **Reasoning control:** `enable_thinking=True` for RL training, `enable_thinking=False` for no-think mode
+  - `enable_thinking=True` → prompt ends with `<think>\n` (model thinks then answers)
+  - `enable_thinking=False` → prompt ends with `<think></think>` (model answers directly)
+- **Off-policy correction:** `use_importance_sampling_correction=True`
+- **10% no-think samples** mixed in during training
+- **Verifiable tasks** (math, code, JSON schema) for binary rewards
+Our approach differs: we're fine-tuning on a style task (interviewer) rather than verifiable capabilities. The reward is heuristic, not binary. This makes RL harder — but the approach is the same.
+---
+## 7) Current Codebase
+### Core files
+| File | Purpose |
+|---|---|
+| `grpo_v7_train.py` | GRPO v7 — LoRA, warmup, 4000-token budget |
+| `run_grpo_v7.sh` | Launch script (detached) |
+| `ssm_generate.py` | `generate_cached`, `generate_cached_batch` |
+| `ssm_scan_triton.py` | Triton fwd+bwd SSM kernel |
+| `ssm_scan_backward.py` | Sequential backward reference |
+| `ssm_decode_fused.py` | Fused Triton decode step |
+| `tests/validate_correct_scan.py` | Mamba layer patcher |
+| `grpo_v6_train.py` | GRPO v6 — full fine-tune (deprecated) |
+### Results on disk
+| Path | Contents |
+|---|---|
+| `results/grpo_v7_run10_stdout.log` | Current run (live) |
+| `results/grpo_v7_run*/` | Earlier v7 smoke tests |
+---
+## 8) Eval Leaderboard (as of 2026-03-26)
+### 5-Score (heuristic, used in GRPO reward)
+| Rank | Model | Score | Words avg |
+|---|---|---|---|
+| 🥇 | **Base Nemotron 4B (llama.cpp)** | **4.35/5** | 59 |
+| 2 | GPT-5.4 | 4.30/5 | 52 |
+| 3 | Nemotron 30B-A3B Q8 | 4.25/5 | 62 |
+| 4 | Gemini 3.1 Pro | 3.70/5 | 82 |
+| 4 | Claude Opus 4.6 | 3.70/5 | 121 |
+| 6 | Qwen3.5-35B-A3B Q8 | 3.55/5 | 51 |
+| 7 | SFT v1 (LoRA) | 2.10/5 | — |
+| 8 | SFT v2 (LoRA) | 2.00/5 | 292 |
+### 10-Score (canonical)
+| Rank | Model | Score |
+|---|---|---|
+| 🥇 | **Base Nemotron 4B (llama.cpp)** | **7.12/10** |
+| 2 | SFT v4 Triton | 5.36/10 |
+| 3 | SFT v2 | 5.08/10 |
+---
+## 9) What's Next: GRPO v8 (On-Policy with vLLM)
+v7 run10 was terminated (was off-policy, had stalled). Next is **GRPO v8** — fully on-policy using vLLM.
+### Design (from NVIDIA Nemotron 3 Nano cookbook)
+- vLLM generates from the **same HF weights** being trained (on-policy)
+- Importance sampling correction for vLLM/HF probability mismatch
+- PPO-style ratio clipping: `ratio_clip_min=0.2, ratio_clip_max=0.28` (asymmetric, from NVIDIA)
+- Overlong filtering: exclude truncated completions from loss
+- 10% no-think mixing (`enable_thinking=False`) — from NVIDIA recipe
+- lr: 3e-6 (lower than v7's 2e-5, matches NVIDIA post-SFT RL lr)
+- LoRA r=32, warmup 30 steps
+### Two-process design
+Training process (.venv-train) holds HF model + LoRA optimizer.
+Generation process (.venv-vllm) holds vLLM.
+Communicate via checkpoint files: train → save LoRA → vLLM reload → generate → train...
+See `docs/GRPO_V8_ONPOLICY_PLAN.md` for complete design.
+---
+## 10) Key Technical Documents
+| Doc | Summary |
+|---|---|
+| `docs/GRPO_V3_POSTMORTEM.md` | 6 gaps that broke off-policy GRPO |
+| `docs/GRPO_V4_POSTMORTEM.md` | Why rmsnorm mock silently broke training |
+| `docs/TRAINING_PLAN_V5.md` | Full plan to beat base model |
+| `docs/TRITON_SSM_SCAN_PLAN.md` | Triton kernel design |
+| `docs/EVAL_RESULTS.md` | Full leaderboard + dimension breakdown |
+---
+*Previous state: `docs/CURRENT_STATE_2026-03-23.md`*

docs/CURRENT_STATE_2026-03-29.md ADDED Viewed

	@@ -0,0 +1,150 @@

+# Current State — Lex Fridman Interviewer Project
+Updated: 2026-03-29 21:30 UTC
+Project root: `/home/bobber/lex-ft`
+HF docs target: `bobber/lex-fridman-interviewer-project`
+---
+## Today's Work (2026-03-29) — Full Summary
+### Major Discoveries
+#### 1. Eval Contamination — 50/50 Held-Out Set is Training Data
+The `data/held_out_eval.jsonl` (50 prompts) was built from the same transcript crawl as `interview_segments_v2.jsonl`. Every guest statement in the eval appears verbatim in training data. Every `ref_question` is identical to the training label.
+**Impact:** All previous eval scores (v2, v3, clean) are invalid as measures of generalization. The base model's apparent strength (7.39/10) is partly explained by pretraining on Lex transcripts (public internet) and eval contamination — not pure capability.
+#### 2. eval_v3 Circularity Problem
+Even with a clean eval set, `eval_v3` (Claude Opus judge, Lex-style dimensions) is circular:
+- Training reward (log-ratio) optimizes toward Lex-style outputs
+- Eval measures Lex-style conformance
+- Both signals measure the same proxy → cannot detect real quality changes
+#### 3. GRPO Makes Things Marginally Worse (Confirmed by 3 Eval Methods)
+| Eval method | Base | GRPO step_100 | Delta |
+|---|---|---|---|
+| eval_v3 (cosine, held-out) | 7.39/10 | 7.38/10 | -0.01 |
+| eval_functional (vLLM + cosine sim, held-out) | 0.363 | 0.331 | -0.032 |
+| **eval_functional_judge (vLLM + Qwen3.5-4B, held-out)** | **0.653** | **0.613** | **-0.040** |
+All three evals agree: GRPO step_100 is worse than base, even on contaminated data.
+Root cause (from per-prompt analysis):
+- `uses_guest` dropped 8pp: step_100 references the specific guest statement less
+- `probing` dropped 4-8pp: step_100 asks for elaboration instead of probing deeper
+- The log-ratio reward pushed toward "average Lex question" (shorter, more archetypal), stripping the contextual specificity that makes questions functionally good
+#### 4. New Canonical Eval: eval_functional_judge.py
+Built and validated a domain-agnostic functional eval:
+- **Policy generation:** vLLM (Nemotron 4B + optional LoRA), batch mode, ~10s for 25 questions
+- **Scoring:** Qwen3.5-4B binary judges (3 questions per prompt)
+  - `on_topic`: Is the question about the same subject?
+  - `uses_guest`: Does it reference the guest's specific words/concepts?
+  - `probing`: Does it probe deeper, not just ask for repetition?
+- **Score:** mean of 3 binary votes, normalized 0-1
+- **Validated:** SHARP=3/4, GENERIC=2/4, OFFTOPIC=1/4, RESTATE=2/4 across all domains including niche technical
+**Why Qwen3.5-4B works (0.8B doesn't):**
+- 0.8B: logit gap Yes/No = ~1.4 (no real discrimination)
+- 0.8B with CoT: generates "Thinking Process..." template, doesn't answer
+- 4B: clean YES/NO with `enable_thinking=False`, correct discrimination across domains
+**Venv:** `.venv-vllm` (Python 3.12) — upgraded transformers to 5.3.0 (vllm warns but works)
+```bash
+# Canonical eval command:
+cd /home/bobber/lex-ft && source .venv-vllm/bin/activate
+python -u eval_functional_judge.py \
+  --model base \
+  --model2 results/grpo_v8/ckpt_step_100 \
+  --n 25 \
+  --output results/functional_judge_base_vs_step100_v2.json
+```
+#### 5. Small Model Judge Research
+Tested Qwen3.5-0.8B as decomposed binary judge — failed:
+- Forced-choice (next-token logit sampling): ~75/25 split regardless of content, no discrimination
+- With CoT (`enable_thinking=True`): generates "Thinking Process" template, no final YES/NO
+- With `enable_thinking=False`: correct format but inverted judgments (RESTATE scores higher than SHARP)
+- Root cause: 0.8B can't model "what's absent from text" — required for novelty judgment
+**Threshold:** 4B is the minimum for reliable interview question quality judgment.
+#### 6. Training Data Quality Analysis
+Of 17,778 (guest → Lex question) pairs in training data:
+- **6,460 (36%)** "lazy" questions: <8 words or specificity < 0.02 (e.g. "How do you approach that?")
+- **7,072 (40%)** real questions: end in `?`, 5-60 words
+- **1,287 (7%)** "sharp" questions: >15 words, specificity > 0.1 (reference guest's specific content)
+The training set has a 5:1 ratio of lazy to sharp questions — we've been training on noise.
+#### 7. GRPO v11 Training (2 runs)
+Both used reward_v11 (info-gain via Qwen 0.5B sim), resumed from GRPO v8 step_100:
+- Run 1: 148 steps, saved ckpts at step_50 + step_100
+- Run 2: 108 steps (killed by gateway), no new ckpt
+- Reward positive throughout (1.1-1.4/5), no collapse
+- But functional eval confirms: didn't improve quality
+---
+## Eval Leaderboard (as of 2026-03-29, functional judge — canonical going forward)
+| Model | Judge Score | on_topic | uses_guest | probing | Notes |
+|-------|-------------|----------|------------|---------|-------|
+| **Base Nemotron 4B** | **0.653 ± 0.333** | 68% | 48% | 80% | ← best |
+| GRPO v11 step_100 | 0.613 ± 0.349 | 68% | 40% | 76% | log-ratio reward |
+**Note:** Scores are on contaminated held-out set. True generalization performance unknown until clean eval set is built.
+---
+## Root Cause Summary: Why Nothing Has Beaten Base
+| Approach | Why it failed |
+|---|---|
+| LoRA SFT (v1, v2) | Pattern matches surface; suppresses base model reasoning |
+| Full SFT (v3, v4, v5) | Same; base model CoT generates better questions than SFT patterns |
+| Off-policy GRPO (v3) | Generator ≠ learner — fundamentally broken |
+| On-policy GRPO v4–v7 | reward_v8 (heuristic) gamed at step ~50 |
+| On-policy GRPO v8/v11 | reward_v10/v11 (log-ratio) is "average Lex" → strips specificity |
+**The real bottleneck:** Training data has 5:1 lazy-to-sharp ratio. Reward signal optimizes toward the modal (average) output, not the exceptional one. The base model generates both; training regresses toward the mean.
+---
+## Next Step: Data Curation
+**Plan:**
+1. Filter existing 7,580 training segments → keep only the 1,287 "sharp" (guest → question) pairs
+2. Score those pairs with `eval_functional_judge` to get ground-truth quality ranking
+3. SFT on top-tier pairs only (target: ~500 highest-scoring)
+4. Eval with `eval_functional_judge` vs base — first clean signal
+**Why this can beat base:**
+- SFT on base model outputs → ceiling is base
+- SFT on curated *ground truth human Lex* at his best → ceiling is Lex's best questions
+- Lex's best questions score 0.368 mean on info-gain (validated) — above base model's 0.363
+**Build a genuinely clean eval set** (separate task):
+- Crawl ~10 recent Lex episodes NOT in the 113-episode training set
+- Use as held-out eval going forward
+---
+## Infrastructure Notes
+| File | Purpose | Status |
+|------|---------|--------|
+| `eval_functional_judge.py` | Canonical eval — vLLM + Qwen3.5-4B judges | ✅ Production ready |
+| `eval_functional_vllm.py` | Alt eval — vLLM + cosine sim (weaker) | ✅ Works but deprecated |
+| `eval_clean.py` | Old eval — contaminated held-out, eval_v3 scorer | ❌ Retire |
+| `eval_judge_test.py` | Small model judge research | Archive |
+| `.venv-vllm` | Canonical Python env (transformers 5.3.0 + vllm 0.18.0) | ✅ Use this |
+| `results/functional_judge_base_vs_step100_v2.json` | Latest eval results | ✅ Ground truth |
+---
+*Previous state: earlier version of this file (2026-03-29 17:00 UTC)*
+*Created: 2026-03-29 21:30 UTC*

docs/CURRENT_STATE_2026-03-30-evening.md ADDED Viewed

	@@ -0,0 +1,81 @@

+# Current State — Lex Fridman Interviewer Project
+*Updated: 2026-03-30 19:19 UTC*
+---
+## Eval Leaderboard (functional judge — canonical)
+| Rank | Model | Score | on_topic | uses_guest | probing | Notes |
+|------|-------|-------|----------|------------|---------|-------|
+| 🥇 | **LoRA v1** (r=64, 1ep, original data) | **0.733** | 72% | 56% | 92% | Best — first to beat base |
+| 2 | Base Nemotron 4B | 0.653 | 68% | 48% | 80% | Pretrained baseline |
+| 3 | LoRA v2 (filtered+upsampled) | 0.640 | 64% | 48% | 80% | No improvement — flat |
+| 4 | SFT v5 (LoRA r≈16, 3ep) | 0.667 | 76% | 60% | 64% | Probing damaged |
+| 5 | GRPO v11 step_100 | 0.613 | 68% | 40% | 76% | reward_v11 anti-correlated |
+**Bottleneck:** uses_guest at 56% (need 70%+). Data interventions have failed.
+---
+## What We've Tried Today (2026-03-30)
+| Experiment | Hypothesis | Result | Lesson |
+|---|---|---|---|
+| SFT v5 → full fine-tune | More params = better | 0.667, probing -16pp | Unsloth fell back to LoRA r≈16 silently |
+| LoRA v1 (r=64, LR=2e-4, 1ep) | Correct LoRA config | **0.733 ✅** | First win over base |
+| Echo-targeted prompt gen | "MUST reference words" prompt | uses_guest -8pp | Base model template bias can't be prompted away |
+| reward_v11 correlation test | Info-gain targets uses_guest | -0.098 anti-correlation | reward_v11 rewards genericity, not specificity |
+| LoRA v2 (filtered+upsampled) | Template contamination is root cause | 0.640, no change | Data-side interventions can't fix weight-level priors |
+---
+## Root Cause Understanding
+The uses_guest gap (48%→56%→stuck) is a **weight-level prior in the Mamba SSM layers**:
+1. The template prior (`P("How do you"|context)`) lives in frozen Mamba-2 layers (38/42 layers)
+2. LoRA can only modify 4 attention layers (1.01% of params)
+3. SFT can only ADD positive signal — cannot SUBTRACT the template prior
+4. Data filtering removes positive template examples but the prior persists
+5. **Only RL (GRPO) can directly suppress the prior** via negative advantage signal
+## Next Step: GRPO with reward_v12 from LoRA v1
+### Why GRPO now
+- LoRA v1 at 0.733 gives a stronger starting point than base (0.653)
+- reward_v12 is validated ✅ (HIGH=1.000 > LOW=0.606, gates work)
+- GRPO gradient flows through ALL 42 layers — can suppress Mamba template prior
+- The model already generates non-template openers ~8% of the time — seeds to amplify
+### reward_v12 design
+```
+reward = ug^0.67 × pr^0.33 + lexical_bonus
+```
+- `ug` = `log P(YES)/P(NO)` from uses_guest judge (continuous)
+- `pr` = `log P(YES)/P(NO)` from probing judge (continuous)
+- `lexical_bonus` = vocab overlap fraction (fast, no model needed)
+- Hard gate: `passes_structural_check` (question mark, min length, no collapse patterns)
+### GRPO stack
+- **Model loading**: Unsloth `FastLanguageModel` (kernel patching, chunk_size=64 for GB10)
+- **LoRA adapter**: `lora/sft-lora-v1` as starting checkpoint
+- **Training**: trl `GRPOTrainer` (patched for transformers 5.x compat)
+- **Reward**: `reward_v12.py` (Qwen3.5-4B judge, batched)
+- **Infrastructure**: systemd-run --user (survives gateway restarts)
+### Files
+- `reward_v12.py` — validated reward function
+- `lora/sft-lora-v1/` — starting checkpoint (0.733)
+- `data/sft_v6_train.jsonl` — filtered dataset (prompts only, for GRPO rollouts)
+---
+## Infrastructure Lessons (today)
+| Issue | Fix | Rule |
+|---|---|---|
+| Gateway OOM kills training | `systemd-run --user` | Always launch training as systemd service |
+| `torch_empty_cache_steps=250` causes VRAM spike | Set to 10 | Always set in TrainingArguments |
+| `use_gradient_checkpointing=False` → VRAM leak | `"unsloth"` GC | Always enable for LoRA training |
+| Training not resumable | `resume_from_checkpoint=latest_ckpt` | Always checkpoint + auto-resume |

docs/CURRENT_STATE_2026-03-30.md ADDED Viewed

	@@ -0,0 +1,159 @@

+# Current State — Lex Fridman Interviewer Project
+Updated: 2026-03-30 14:10 UTC
+Project root: `/home/bobber/lex-ft`
+HF docs target: `bobber/lex-fridman-interviewer-project`
+---
+## Today's Work (2026-03-30) — Full Summary
+### 1. SFT Retrospective — 4 Attempts (v1–v4)
+All used 201 curated pairs. Results: base=0.653, SFT=0.467 (worse).
+Root causes identified:
+- v1: mamba_ssm mock breaks backprop (grad_norm 200, loss stuck at 34)
+- v2: Native transformers 5.3.0 — no compiled SSM kernels (180s/step naive fallback)
+- v3: Triton patch + mock — wrong starting loss (56 vs 28), crashed at step 25 (`_get_tied_weight_keys` bug)
+- v4: Real mamba_ssm via LD_LIBRARY_PATH + Triton patch — completed but 201 pairs too few (underfitting, loss 21, model degrades)
+Key lesson: **201 pairs / 3.97B params = 20M params/example — wildly unstable**. Need 1,000-5,000+ pairs.
+---
+### 2. Data Expansion
+**Crawled all 225 Lex transcript URLs → 114 unique episodes** (225 URLs were duplicates from pagination). No new episodes available.
+**Augmented with base model generation:**
+- 3,480 unique real Lex pairs (from 114 episodes, structural filter)
+- Generated 3 completions × 3,480 guests = 10,440 candidates
+- Structural filter → 9,364 total pairs saved to `data/lex_pairs_10k.jsonl`
+**Judged with vLLM Qwen3.5-4B (batch mode, ~34 min for 9,364 × 3 = 28,092 queries):**
+- score=1.0 (3/3 judges): 4,772 pairs (51%)
+- score=0.67 (2/3 judges): 1,768 pairs (19%)
+- score=0.33: 1,497 pairs (16%)
+- score=0.00: 1,327 pairs (14%)
+**Training set: `data/sft_v5_train.jsonl` — 4,772 perfect-score pairs**
+- Real Lex: 697 | Generated: 4,075
+- Avg score: 1.0 (by definition)
+---
+### 3. Venv Infrastructure Fix
+**Problem:** `.venv-train` had torch CPU-only → Unsloth couldn't initialize GPU.
+**Fix sequence:**
+1. Install torch 2.10.0+cu130 (CUDA build) into `.venv-train`
+2. Recompile `mamba_ssm` from source against new torch (`.so` had ABI mismatch)
+3. Install `unsloth` 2026.3.17
+**Result: `.venv-train` now has:**
+- `torch 2.10.0+cu130` — CUDA enabled
+- `unsloth 2026.3.17` — memory optimizations, 2x faster training
+- `mamba_ssm 2.3.1` — **real compiled Triton kernels** (no mock needed)
+- Requires: `LD_LIBRARY_PATH=/path/to/.venv-train/torch/lib` (set in launch script)
+**Key insight:** No mock needed with real mamba_ssm. No `patch_mamba_layers` needed. Pure HF Trainer + Unsloth + real kernels.
+---
+### 4. SFT v5 — LoRA r≈16 (COMPLETED, 2026-03-30)
+> ⚠️ **Naming correction:** Despite `full_finetuning=True` being set, Unsloth silently fell back
+> to LoRA for NemotronH. Training log showed **10.1M / 2.66B (0.38%) trainable**, equivalent to
+> **LoRA r≈16**. This is NOT a full fine-tune. See `docs/LORA_V1_ANALYSIS.md` for details.
+**Script:** `scripts/train_sft_v5.py`
+**W&B run:** `lex-sft-v5-4k-bnb8bit` → https://wandb.ai/bobber-cheng/lex-interviewer/runs/udqlwz88
+**Config (actual):**
+| Parameter | Value |
+|-----------|-------|
+| Data | `data/sft_v5_train.jsonl` (4,772 pairs) |
+| Architecture | **LoRA r≈16** (Unsloth fallback — NOT full fine-tune) |
+| Trainable params | **10.1M / 2.66B (0.38%)** |
+| Framework | Unsloth + HF Trainer |
+| Epochs | 3 |
+| LR | 1e-5 ⚠️ (too low for LoRA — designed for full fine-tune) |
+| Batch | 2 × 8 = 16 effective (BNB 8-bit Adam) |
+| Max seq | 512 |
+| Steps | 897 total |
+| Final loss | 1.86 |
+**Result: 0.667 functional score — marginal vs base 0.653.**
+Probing DAMAGED: 80% → 64% (memorized question surface format over 3 epochs).
+**Why it underperformed:**
+1. Low rank (r≈16): insufficient capacity for nuanced task
+2. LR=1e-5 too low for LoRA (correct for full fine-tune, not for adapters)
+3. 3 epochs: memorized surface pattern, destroyed depth
+---
+## Venv Reference
+| Venv | Python | Torch | CUDA | mamba_ssm | Unsloth | Use for |
+|------|--------|-------|------|-----------|---------|---------|
+| `.venv-train` | 3.12 | 2.10+cu130 | ✅ | ✅ Real (compiled) | ✅ 2026.3.17 | **SFT training** |
+| `.venv-vllm` | 3.12 | 2.9+cu130 | ✅ | ❌ x86 .so | ❌ | **vLLM inference/eval** |
+| `routangseng/.venv` | 3.13 | 2.9+cu130 | ✅ | ❌ broken .so | ✅ | Legacy/Qwen training |
+**Critical:** `.venv-train` requires `LD_LIBRARY_PATH` set before launch:
+```bash
+TORCH_LIB=/home/bobber/lex-ft/.venv-train/lib/python3.12/site-packages/torch/lib
+export LD_LIBRARY_PATH="$TORCH_LIB:${LD_LIBRARY_PATH:-}"
+source /home/bobber/lex-ft/.venv-train/bin/activate
+```
+---
+## Pipeline Reference
+```
+Transcripts (114 eps)
+    ↓ scripts/crawl_transcripts.py
+data/transcripts/*.json
+    ↓ scripts/augment_with_base_model.py (vLLM .venv-vllm)
+data/lex_pairs_10k.jsonl  (9,364 pairs)
+    ↓ scripts/judge_vllm.py (vLLM + Qwen3.5-4B, .venv-vllm)
+data/lex_pairs_10k_judged.jsonl  (scored)
+    ↓ filter judge_score==1.0
+data/sft_v5_train.jsonl  (4,772 perfect pairs)
+    ↓ scripts/train_sft_v5.py (.venv-train + Unsloth)
+models/sft-v5/
+    ↓ eval_functional_judge.py (vLLM + Qwen3.5-4B, .venv-vllm)
+results/
+```
+---
+## Eval Infrastructure
+| File | Purpose | Venv |
+|------|---------|------|
+| `eval_functional_judge.py` | Canonical eval — vLLM + Qwen3.5-4B 3-judge | .venv-vllm |
+| `scripts/judge_vllm.py` | Batch judge for dataset scoring | .venv-vllm |
+| `scripts/augment_with_base_model.py` | Generate + filter training data | .venv-vllm |
+| `scripts/train_sft_v5.py` | SFT training | .venv-train |
+| `run_sft_v5.sh` | Launch script (sets LD_LIBRARY_PATH) | .venv-train |
+---
+## Key Results So Far
+| Model | Judge Score | on_topic | uses_guest | probing | Notes |
+|-------|-------------|----------|------------|---------|-------|
+| Base Nemotron 4B | 0.653 ± 0.333 | 68% | 48% | 80% | Baseline |
+| GRPO step_100 | 0.613 | — | — | — | log-ratio reward, slightly worse |
+| SFT curated v4 (201 pairs) | 0.467 | — | — | — | Too few pairs, catastrophic forgetting |
+| SFT v5 **(LoRA r≈16, 3ep, LR=1e-5)** | 0.667 | 76% | 60% | **64%** ⚠️ | Unsloth fallback; probing damaged |
+| **LoRA v1 (r=64, 1ep, LR=2e-4)** | **0.733** | 72% | 56% | **92%** ✅ | First to beat base |
+---
+*Created: 2026-03-30 04:35 UTC*

docs/DATA_CURATION_PLAN.md ADDED Viewed

	@@ -0,0 +1,131 @@

+# Data Curation Plan — Sharp Lex Questions
+Created: 2026-03-29
+Status: Next step (not yet started)
+---
+## Motivation
+Training data quality analysis revealed:
+- 17,778 total (guest → Lex question) pairs
+- **6,460 (36%) "lazy"**: <8 words or specificity <0.02 ("How do you approach that?")
+- **1,287 (7%) "sharp"**: >15 words, specificity >0.1 (reference guest's specific content)
+- Ratio: 5:1 lazy to sharp — we've been drowning signal in noise
+Every fine-tuning attempt to date trained on the full mixed dataset. The model learned "average Lex" (which is worse than the base model's contextual generation).
+**Key insight:** SFT on curated *Lex at his best* has a higher ceiling than SFT on base model outputs. Lex's best questions are real human ground truth — they can exceed what the base model generates spontaneously.
+---
+## Why This Can Beat Base
+- SFT on base model outputs → ceiling is base model
+- SFT on reward-filtered base outputs → still ceiling is base
+- SFT on curated ground truth (sharp Lex questions) → ceiling is Lex's best
+Validated: real sharp Lex questions score ~0.37 mean info-gain, vs base model 0.36. The margin is small but real, and with proper curation we can select the top quintile.
+---
+## Step 1: Filter Training Data → Sharp Questions
+```python
+# Criteria for "sharp" question:
+# 1. Ends in '?'
+# 2. 10-50 words (not too short = generic, not too long = rambling)
+# 3. Specificity score > 0.10 (shares >10% of guest's 5+ char words)
+# 4. Not a statement ("Right.", "Exactly.", "So...")
+# 5. Not a filler ("How does that make you feel?", "Tell me more...")
+# Expected yield: ~1,287 pairs from 7,580 training segments
+# (~17% of pairs, since many segments have multiple turns)
+```
+**Script:** `scripts/curate_sharp_questions.py` (to build)
+---
+## Step 2: Score with eval_functional_judge
+Run the 3-judge eval on the curated pairs to rank them:
+```python
+# For each (guest, sharp_question) pair:
+# - Run on_topic, uses_guest, probing judges
+# - Keep top 500 by score
+# - This gives us the "Lex at his best" dataset
+# Expected: ~500-700 pairs scoring 3/3 judges
+```
+**Note:** This is different from eval — we're scoring HUMAN questions (Lex's), not model-generated ones. These are the gold examples we want the model to learn from.
+---
+## Step 3: SFT on Curated Data
+```bash
+# Use existing full-SFT pipeline (train_full_sft_v3_optimized.py)
+# Dataset: ~500 top-scored (guest, Lex_question) pairs
+# Epochs: 3-5 (small dataset needs more passes)
+# LR: 1e-5 (conservative — small dataset)
+# Max seq len: 512 (questions are short)
+```
+Key difference from all previous SFT runs:
+- Previous: all 7,580 segments (5:1 noise ratio)
+- Now: ~500 curated sharp pairs (near-100% signal)
+---
+## Step 4: Eval with eval_functional_judge
+```bash
+python -u eval_functional_judge.py \
+  --model checkpoints/sft_curated/checkpoint-best \
+  --model2 base \
+  --n 25
+```
+**Success criterion:** Judge score > 0.653 (base model baseline)
+**If successful:** First fine-tuned model to beat base in project history
+---
+## Step 5 (if Step 4 succeeds): Build Clean Eval Set
+The contamination problem makes absolute scores meaningless. Once we have a model that beats base on the contaminated set, we need to verify it generalizes:
+```python
+# Crawl 10 recent Lex episodes not in training (post-2024)
+# Extract ~50 guest utterances
+# Use as held-out eval going forward
+```
+---
+## Timeline Estimate
+| Step | Time | Notes |
+|------|------|-------|
+| Build curation script | 1h | Filter + specificity scoring |
+| Run curation + judge scoring | 30 min | Judge 1,287 pairs |
+| SFT training | 2-4h | ~500 pairs, 3 epochs |
+| Eval | 15 min | eval_functional_judge |
+| Total | ~4-6h | |
+---
+## Files to Build
+| File | Purpose |
+|------|---------|
+| `scripts/curate_sharp_questions.py` | Filter + score training data |
+| `data/sharp_questions_curated.jsonl` | Output: top ~500 pairs |
+| `run_sft_curated.sh` | Training launch script |
+---
+*Created: 2026-03-29 21:30 UTC*

docs/EVAL_FRAMEWORK_2026-03-29.md ADDED Viewed

	@@ -0,0 +1,148 @@

+# Eval Framework — Lex Fridman Interviewer
+Updated: 2026-03-29
+Status: Complete rewrite after contamination discovery + judge validation
+---
+## TL;DR
+**Use `eval_functional_judge.py` for all future evals.**
+Old evals (eval_v2, eval_v3, eval_clean) are retired — contaminated data + circular signal.
+---
+## What Went Wrong With Previous Evals
+### Problem 1: Eval Contamination
+`data/held_out_eval.jsonl` was built from the same 113-episode crawl as training data.
+Result: 50/50 held-out prompts appear verbatim in `interview_segments_v2.jsonl`.
+All eval scores before 2026-03-29 evening are invalid as generalization measures.
+### Problem 2: Lex Circularity
+`eval_v3` (Claude Opus judge, Lex-style dimensions) measures:
+- Philosophical depth in Lex's style
+- Curiosity in Lex's style
+- Specificity in Lex's style
+Both the training reward (log-ratio: "sounds like Lex") and the eval measure the same proxy.
+The base model gets 7.39/10 because it was pretrained on Lex's public transcripts.
+Fine-tuning toward this eval cannot exceed pretraining — it just reinforces the same surface patterns.
+### Problem 3: Qwen 0.5B Simulator Fails on Niche Topics
+`eval_functional.py` used Qwen 0.5B to simulate guest responses, then cosine similarity.
+Works for common topics (politics, philosophy). Fails for technical content (SSM/BF16/Mamba):
+- Qwen 0.5B doesn't know what "exp(cumsum(A)) underflow" means
+- Generates plausible-sounding but semantically wrong responses
+- Expert question scores LOWER than generic question (inverted signal)
+---
+## The New Canonical Eval: eval_functional_judge.py
+### Architecture
+```
+held_out_eval.jsonl
+       │
+       ▼
+[vLLM: Nemotron 4B] ──batch 25 prompts──► questions (10s)
+       │
+       ▼ (del llm, empty_cache)
+[Qwen3.5-4B Judge] ──3 binary judges per question──► scores
+       │
+       ▼
+score = mean(on_topic, uses_guest, probing) ∈ [0, 1]
+```
+### The 3 Judges
+| Judge | Prompt summary | What it catches |
+|-------|---------------|-----------------|
+| `on_topic` | Is question about same subject as guest? | Off-topic tangents |
+| `uses_guest` | Does it reference guest's specific words/concepts? | Generic questions that ignore what was said |
+| `probing` | Does it probe deeper, not just ask for repetition? | "Can you say more about that?" questions |
+### Why These 3 Judges
+From per-prompt analysis of base vs GRPO step_100:
+- The biggest quality difference was in `uses_guest` (-8pp) and `probing` (-4-8pp)
+- `on_topic` was stable — both models stay on subject
+- These 3 together correctly rank: SHARP > RESTATE > GENERIC > OFFTOPIC
+### Validation Results
+| Question type | Score | on_topic | uses_guest | probing |
+|---|---|---|---|---|
+| SHARP (probing, specific) | 3/4 | Y | Y | Y |
+| GENERIC (on topic, not specific) | 2/4 | Y | N | N |
+| RESTATE (asks for repetition) | 2/4 | Y | Y | N |
+| OFFTOPIC | 1/4 | N | N | N |
+Tested across: general AI/LM topics, niche technical (SSM/BF16), political/historical.
+4B handles all domains correctly. 0.8B fails (inverted judgments).
+---
+## Running Evals
+### Standard comparison
+```bash
+cd /home/bobber/lex-ft && source .venv-vllm/bin/activate
+python -u eval_functional_judge.py \
+  --model base \
+  --model2 results/grpo_v8/ckpt_step_100 \
+  --n 25 \
+  --output results/my_eval.json
+```
+### Single model
+```bash
+python -u eval_functional_judge.py --model base --n 25
+python -u eval_functional_judge.py --model results/grpo_v8/ckpt_step_100 --n 25
+```
+### Runtime
+- Base model only: ~8 min (2 min vLLM load + 10s gen + 5 min judging)
+- Two model comparison: ~15 min total (vLLM reloads for step_100 + LoRA)
+---
+## Known Limitations
+1. **Contaminated held-out set** — all 50 prompts are in training data. Use for comparison only; absolute scores don't reflect true generalization.
+2. **Stochastic generation** — temperature=0.7 means re-runs vary slightly. Use n≥25 for signal.
+3. **Judge agreement** — 3 binary votes is coarse. High variance (std ~0.33). Need n≥50 for statistically significant deltas.
+4. **No clean eval set yet** — need to crawl recent episodes not in training data.
+---
+## TODO: Build Clean Eval Set
+```python
+# Target: 50 guest utterances from episodes NOT in interview_segments_v2.jsonl
+# Criteria:
+#   - Recent episodes (post 2024, likely post Nemotron pretraining cutoff)
+#   - Guest domains: mix of technical, political, philosophical, creative
+#   - Guest utterances: 50-200 words, substantive statements
+# Episodes crawled: 113 (all in training)
+# Need: ~10 new episodes → ~50 new eval prompts
+```
+---
+## Deprecated Evals
+| File | Why deprecated |
+|------|---------------|
+| `eval_clean.py` | Contaminated held-out set; eval_v3 scorer (Lex circularity) |
+| `eval_functional.py` | Qwen 0.5B sim fails on niche topics |
+| `eval_functional_vllm.py` | Cosine sim still weak; replaced by judge |
+| `scripts/eval_v2.py` | Lex-style dimensions; circular |
+| `scripts/eval_v3.py` | Same |
+| `eval_judge_test.py` | Research script; not for production eval |
+---
+*Created: 2026-03-29 21:30 UTC*

docs/EVAL_RESULTS.md ADDED Viewed

	@@ -0,0 +1,319 @@

+> **Status:** ✅ UPDATED 2026-04-05 — Group judge (Qwen 3.5 27B + Gemma 4 31B majority vote) leaderboard added. Cloud model comparison.
+# Eval Results — Lex Fridman AI Interviewer
+---
+## ⚠️ Eval Framework History
+| Period | Eval method | Bias risk | Status |
+|---|---|---|---|
+| Pre-2026-03-29 | 5-score / 10-score (Claude Opus, Lex-style) | High circularity | Legacy only |
+| 2026-03-29 | info-gain functional eval | anti-correlation with uses_guest | Abandoned |
+| 2026-03-30+ | 3-judge functional (on_topic × uses_guest × probing) | Low — no Lex style | **Canonical** |
+| 2026-04-03+ | Same 3-judge, **thinking-enabled** (`enable_thinking=True` + `reasoning_parser=nemotron_v3`) | Low | **Current canonical** |
+| 2026-04-05+ | **Group judge**: Qwen 3.5 27B + Gemma 4 31B majority vote per dimension | Lowest — multi-model | **Current canonical (cross-model)** |
+---
+## ═══════════════════════════════════════════════════════════════
+## CURRENT BEST: GRPO v21 — 0.867 (thinking-enabled)
+## ═══════════════════════════════════════════════════════════════
+Adapter: `/home/bobber/lex-ft/lora/grpo-v21`
+ONNX: `bobber/lex-interviewer-nemotron-4b-grpo-v21`
+Space: `bobber/lex-interviewer-chat`
+---
+## Thinking-Enabled Functional Eval Leaderboard (canonical, 2026-04-03)
+*Eval: `eval_functional_judge.py --enable-thinking`, 25 held-out prompts, `enable_thinking=True`, `reasoning_parser=nemotron_v3`*
+| Rank | Model | Score | on_topic | uses_guest | probing | Avg words | Notes |
+|------|-------|-------|----------|------------|---------|-----------|-------|
+| 🥇 | **GRPO v21** | **0.867 ± 0.231** | 84% | 80% | **96%** | ~13 | **Best ever** |
+| 2 | Base Nemotron 4B | 0.760 ± 0.371 | — | — | — | — | Strong baseline |
+| 3 | GRPO v22 | 0.813 ± 0.314 | 84% | 72% | 88% | ~10 | Less clipping, more generic |
+| 4 | GRPO v24 | 0.693 ± 0.399 | 64% | 68% | 76% | ~16 | reward_v13 from v2-native |
+| 5 | LoRA v2 native | 0.707 ± 0.331 | 72% | 60% | 84% | ~15 | Best pure SFT |
+| 6 | GRPO v23 | 0.760 ± 0.371 | 84% | 68% | 76% | ~10 | reward_v13 from v21, tied base |
+---
+## Group Judge Leaderboard — Cloud + Local (2026-04-05)
+*Eval: `eval_cloud_models.py` + `eval_local_group_judge.py`, 25 held-out prompts, majority vote of Qwen 3.5 27B + Gemma 4 31B per dimension*
+| Rank | Model | Score | on_topic | uses_guest | probing | Avg words | Notes |
+|------|-------|-------|----------|------------|---------|-----------|-------|
+| 🥇 | **GPT-5.4** | **0.867 ± 0.211** | 92% | 68% | 100% | ~29 | Best cloud model |
+| 2 | Gemini 3.1 Pro | 0.840 ± 0.341 | 84% | 80% | 88% | ~25 | |
+| 3 | **GRPO v21 (4B)** | **0.787 ± 0.376** | 80% | 72% | 84% | ~16 | **Tied Opus — 4B model** |
+| 3 | Claude Opus 4.6 | 0.787 ± 0.364 | 76% | 72% | 88% | ~56 | Verbose (3.5× more words) |
+### Single-Judge Comparison (Gemma 4 31B only)
+| Rank | Model | Score | on_topic | uses_guest | probing | Avg words |
+|------|-------|-------|----------|------------|---------|-----------|
+| 1 | GPT-5.4 | 0.973 ± 0.131 | 96% | 96% | 100% | ~29 |
+| 2 | Gemini 3.1 Pro | 0.893 ± 0.244 | 92% | 88% | 88% | ~25 |
+| 3 | Claude Opus 4.6 | 0.880 ± 0.281 | 84% | 84% | 96% | ~56 |
+*Note: Single Gemma judge is more lenient than group judge. Group judge (majority vote) is the canonical cross-model eval.*
+---
+## Non-Thinking Functional Eval Leaderboard (historical, 2026-03-30 – 2026-04-02)
+*Eval: `eval_functional_judge.py`, 25 prompts, no thinking, legacy comparison*
+| Rank | Model | Score | on_topic | uses_guest | probing | Notes |
+|------|-------|-------|----------|------------|---------|-------|
+| 1 | **LoRA v2 native** | **0.760** | 76% | 68% | 80% | Correct GB10 kernel path |
+| 2 | Base Nemotron 4B | 0.753 | — | — | — | |
+| 3 | GRPO v20/v21 (non-thinking eval) | 0.720 ± 0.336 | — | — | — | Eval had thinking disabled |
+| 4 | LoRA v1 (r=64, 1ep) | 0.733 | 72% | 56% | 92% | First to beat base |
+| 5 | GRPO v13 | 0.773 | 88% | 52% | 92% | Best before native path |
+| 6 | GRPO v12 | 0.760 | 72% | 60% | 96% | |
+| 7 | SFT v5 (LoRA r≈16) | 0.667 | 76% | 60% | 64% ⚠️ | Probing damaged |
+| 8 | Base Nemotron 4B | 0.653 | 68% | 48% | 80% | Older run |
+| 9 | LoRA v2 (filtered data) | 0.640 | — | — | — | No gain from filtering |
+| 10 | GRPO v14 | 0.707 | — | 52% | — | Reward misaligned |
+| 11 | GRPO v11 step_100 | 0.613 | — | — | — | info-gain reward failed |
+---
+## Training & Model Lineage — Knowledge Graph
+```
+BASE MODEL
+└── nvidia/NVIDIA-Nemotron-3-Nano-4B
+    Architecture: 38 Mamba-2 + 4 Attention layers
+    Path: models/NVIDIA-Nemotron-3-Nano-4B
+    Config: config_native.json (native transformers 5.3 path, not trust_remote_code)
+    │
+    ├── [SFT Phase 1 — OLD BROKEN PATH — pre-2026-04-02]
+    │   │
+    │   ├── SFT v1 (Unsloth, LoRA r≈16, LR=1e-5, 3ep)
+    │   │   Score: 2.10/5  ← catastrophic
+    │   │
+    │   ├── SFT v5 "full-ft" → actually LoRA r≈16 (Unsloth silently fell back)
+    │   │   Score: 0.667  uses_guest=60% probing=64%⚠️  ← probing damaged
+    │   │
+    │   ├── LoRA v1 (r=64, alpha=128, LR=2e-4, 1ep, 299 steps)
+    │   │   Score: 0.733  uses_guest=56% probing=92%  ← first to beat base
+    │   │   Adapter: lora/sft-lora-v1
+    │   │   │
+    │   │   └── [GRPO Phase 1 — OLD PATH]
+    │   │       ├── GRPO v12 (LR=5e-6, reward_v12, 200 steps)
+    │   │       │   Score: 0.760  uses_guest=60% probing=96%
+    │   │       │
+    │   │       ├── GRPO v13 (LR=2e-5, constant, 300 steps)
+    │   │       │   Score: 0.773  on_topic=88% uses_guest=52%⚠️ probing=92%
+    │   │       │   ← LR too high, uses_guest regressed
+    │   │       │
+    │   │       └── GRPO v14 (reward_v13 geomean, LR=1e-5)
+    │   │           Score: 0.707  ← reward misaligned
+    │   │
+    │   └── LoRA v2 (filtered data, 0% generic openers, 60% real Lex ×6)
+    │       Score: 0.640  uses_guest=48%  ← no gain from data filtering
+    │
+    ├── [SFT Phase 2 — NATIVE PATH — 2026-04-02+]
+    │   │  Fix: use native transformers 5.3 NemotronH (cuda_kernels_forward)
+    │   │  Patches: config validator, MIXER_TYPES, block_type_to_mask for "mlp"
+    │   │  Train PPL: ~1.29 (vs ~21.9 on broken torch_forward path)
+    │   │
+    │   └── LoRA v2 native (r=64, alpha=128, LR=2e-4, 1ep, 12 min)
+    │       Score: 0.760  uses_guest=68% probing=80%   ← best SFT ever
+    │       Adapter: lora/sft-lora-v2-native
+    │       Dataset: data/sft_v5_train.jsonl (4,772 pairs, 697 real + 4,075 generated)
+    │       │
+    │       └── [GRPO Phase 2 — NATIVE + THINKING — 2026-04-02+]
+    │           │  Framework: TRL GRPOTrainer + vLLM colocate
+    │           │  Thinking: enable_thinking=True + reasoning_parser=nemotron_v3
+    │           │  Reward: reward_v12 (uses_guest×probing geomean + lexical bonus)
+    │           │
+    │           ├── GRPO v19 (smoke test, 50 steps, reward_v12)
+    │           │   reward mean: 0.39  IS ratio: 0.025  ← infrastructure verified
+    │           │
+    │           ├── GRPO v20 (200 steps, MAX_NEW_TOKENS=800)
+    │           │   Score (non-thinking eval): 0.720 ± 0.336
+    │           │   Clipping: ~50% of steps had ≥1 clipped completion
+    │           │   ← budget too small, thinking truncated
+    │           │
+    │           ├── ★ GRPO v21 (200 steps, MAX_NEW_TOKENS=1600, MAX_SEQ=3072)
+    │           │   Score (thinking-enabled): 0.867 ± 0.231  ← BEST EVER
+    │           │   on_topic=84% uses_guest=80% probing=96% avg_words=13
+    │           │   Clipping: 32.5% of steps, avg_group_std=0.228  ← Goldilocks
+    │           │   Reward delta: +0.083 (first20=0.631 → last20=0.715)
+    │           │   W&B: lex-grpo-v21-think-colocate-long / z8lcuut7
+    │           │   Adapter: lora/grpo-v21
+    │           │   ONNX: bobber/lex-interviewer-nemotron-4b-grpo-v21  ← deployed
+    │           │
+    │           ├── GRPO v22 (200 steps, MAX_NEW_TOKENS=2560, MAX_SEQ=4096, CLIP_PENALTY=0.10)
+    │           │   Score (thinking-enabled): 0.813 ± 0.314
+    │           │   on_topic=84% uses_guest=72% probing=88% avg_words=10
+    │           │   Clipping: 9.5% of steps  ← less clipping BUT less contrast
+    │           │   Reward delta: +0.013  ← much weaker learning signal
+    │           │   ← larger budget eliminated contrast generators
+    │           │
+    │           ├── [GRPO Phase 3 — reward_v13 — 2026-04-03]
+    │           │   reward_v13: adds meta-spill penalty, generic-opener penalty,
+    │           │               soft overthinking penalty; no reward for long thinking
+    │           │
+    │           ├── GRPO v23 (reward_v13, from grpo-v21, 1600 tok)
+    │           │   Score (thinking-enabled): 0.760 ± 0.371  ← tied base
+    │           │   Reward delta: -0.019  ← reward_v13 too strict from strong start
+    │           │   ← started at local optimum, reward_v13 compressed variance
+    │           │
+    │           └── GRPO v24 (reward_v13, from sft-lora-v2-native, 1600 tok)
+    │               Score (thinking-enabled): 0.693 ± 0.399  ← tied base
+    │               Reward delta: -0.018
+    │               ← reward_v13 not generating sufficient learning signal
+    │
+    └── [GRPO — OLD FRAMEWORK — pre-native, 2026-03-21–29]
+        GRPO v3 (off-policy, llama.cpp gen + HF train) → gibberish
+        GRPO v7/v8/v11 → various failures, documented in memory/2026-03-*.md
+```
+---
+## Key Technical Details
+### Architecture
+- **Model:** Nemotron-3-Nano-4B (hybrid Mamba-2 + Attention)
+- **Layers:** 38 Mamba-2 + 4 Attention (42 total)
+- **LoRA targets:** q/k/v/o_proj, up/down/gate_proj → only touches the 4 attention layers (standard LoRA)
+- **LoRA config:** r=64, alpha=128, dropout=0, trainable=40.5M/4.01B (1.01%)
+### The GB10 Kernel Fix (2026-04-02)
+- NVIDIA's HF `modeling_nemotron_h.py` had `is_fast_path_available = False` hardcoded
+- This forced naive `torch_forward` SSM scan → PPL ~2126
+- Fix: use native transformers 5.3.0 built-in `NemotronHForCausalLM` with 3 patches:
+  - `configuration_nemotron_h.py`: config validator accepts `"mlp"` block type
+  - `modeling_nemotron_h.py`: add `MIXER_TYPES["mlp"]` and `block_type_to_mask["mlp"]`
+  - `config_native.json`: use `layers_block_type` list instead of `hybrid_override_pattern`
+- Result: train PPL 1.29 avg (vs 21.86 on broken path)
+### Thinking-Enabled Inference Stack
+- vLLM 0.18.0 + `structured_outputs_config={'reasoning_parser': 'nemotron_v3'}`
+- Chat template: `enable_thinking=True` → prompt ends with `<|im_start|>assistant\n<think>\n`
+- Output: `reasoning_content` (thinking) + `content` (answer) parsed by vLLM
+- Eval uses `max_model_len=3072`, `max_tokens=1600`, `gpu_memory_utilization=0.45`
+### GRPO Framework
+- TRL `GRPOTrainer` + vLLM `vllm_mode="colocate"` (single GPU)
+- Group size: 4 completions per prompt
+- KL coefficient: β=0.001
+- Importance sampling correction enabled
+- LR: 5e-6 cosine with linear warmup
+---
+## Why GRPO v21 Succeeded — Formal Summary
+The GRPO learning quality score `first50_std × sign(Δreward) × |Δreward|^0.5` tracks eval perfectly:
+| run | eval | first50_std | reward_Δ | GRPO_score |
+|---|---|---|---|---|
+| **v21** | **0.867** | **0.268** | **+0.083** | **0.078** |
+| v22 | 0.813 | 0.156 | +0.013 | 0.018 |
+| v23 | 0.760 | 0.187 | -0.019 | -0.004 |
+| v24 | 0.693 | 0.195 | -0.018 | -0.004 |
+**Root cause:** GRPO learns from intra-group *contrast*, not from correctness.
+```
+GRPO_success = P(≥1 zero per group) ≈ 0.25–0.35
+             × hard_binary_gate (0.0 fail vs 0.7+ pass)
+             × starting_below_optimum
+```
+v21 hit the Goldilocks zone:
+- 1600-token budget → 32.5% clipping rate → intra-group std = 0.228 (high)
+- reward_v12 hard gate: clipped/meta → exactly 0.0; clean questions → 0.7–1.0
+- Starting from `sft-lora-v2-native` (score=0.631) with room to climb to 0.715
+- corr(q_tok, group_std) = +0.70: long-answer steps (meta-spill/clipped) = highest contrast
+Full analysis: `docs/GRPO_V21_SUCCESS_ANALYSIS.md`
+---
+## Reward Function History
+| Version | Signal | Key feature | Result |
+|---|---|---|---|
+| reward_v9 | NLI entailment depth | Hard gates + NLI depth score | Baseline structure |
+| reward_v10 | log-ratio (SFT likelihood) | Style-matching | Circular — rewards Lex-sounding output |
+| reward_v11 | info-gain (novelty × relevance via Qwen-0.5B simulator) | Functional quality | Anti-correlated with uses_guest (-0.098) |
+| **reward_v12** | uses_guest^0.67 × probing^0.33 + lexical_bonus | Batch logit-gap judges | **Best** — used in GRPO v12–v22 |
+| reward_v13 | v12 + meta-spill penalty + generic-opener penalty + soft overthinking penalty | Stricter gates | Compressed contrast, underperformed |
+---
+## Dataset History
+| Dataset | Size | Quality | Used in | Notes |
+|---|---|---|---|---|
+| interview_segments_v2.jsonl | 7,580 | Mixed (36% lazy, 7% sharp) | GRPO v8–v11 | 5:1 lazy:sharp ratio |
+| sft_v5_train.jsonl | 4,772 | All judge_score=1.0 | SFT v5, LoRA v1/v2 | 697 real Lex + 4,075 generated |
+| sft_v6_train.jsonl | 6,933 | Filtered, 0% generic openers | LoRA v2 filtered | No improvement over v5 |
+| held_out_eval.jsonl | 50→25 | ⚠️ contaminated with training | Evals | First 25 only — 50/50 overlap |
+---
+## Eval Infrastructure
+| Script | Purpose | Status |
+|---|---|---|
+| `eval_functional_judge.py` | 3-judge batch eval (single judge, local models) | Active |
+| `scripts/eval_cloud_models.py` | Cloud model eval with single/group judge | Active |
+| `scripts/eval_local_group_judge.py` | Local model eval with sequential group judge | Active |
+| `eval_functional.py` | Info-gain functional eval | Deprecated |
+| `scripts/eval_v2.py` | 10-score Lex-style eval | Legacy |
+| `scripts/eval_via_server.py` | 5-score via llama.cpp server | Legacy |
+### Running current eval
+```bash
+cd /home/bobber/lex-ft
+source .venv-vllm/bin/activate
+python3 eval_functional_judge.py --enable-thinking \
+  --model base \
+  --model2 /home/bobber/lex-ft/lora/sft-lora-v2-native \
+  --model3 /home/bobber/lex-ft/lora/grpo-v21
+```
+---
+## Legacy Eval Results (pre-functional, for reference only)
+### 10-Score Leaderboard (2026-03-29, ⚠️ circular)
+| Rank | Model | 10-score | Avg Words |
+|---|---|---|---|
+| 🥇 | Base Nemotron 4B | 7.39/10 | 13 |
+| 2 | GRPO v11 step_100 | 7.38/10 | 13 |
+| 3 | SFT v4 Triton (ck200) | 5.36/10 | 50 |
+| 4 | SFT v2 (ck100) | 5.08/10 | 54 |
+### 5-Score Leaderboard (legacy)
+| Rank | Model | 5-score | Words |
+|---|---|---|---|
+| 🥇 | Base Nemotron 4B | 4.35/5 | 59 |
+| 🥈 | GPT-5.4 | 4.30/5 | 52 |
+| 🥉 | Nemotron 30B-A3B Q8 | 4.25/5 | 62 |
+| 4 | SFT v4 Triton (ck200) | 3.80/5 | 50 |
+| 5 | Gemini 3.1 Pro / Claude Opus 4.6 | 3.70/5 | 82–121 |
+| 6 | SFT v2 (ck100) | 3.70/5 | 54 |
+| 7 | Qwen3.5-35B-A3B Q8 | 3.55/5 | 51 |
+| 8 | Qwen3.5-27B Q8 | 3.25/5 | 75 |
+| 9 | Gemini 2.5 Pro | 3.00/5 | 103 |
+| 10 | Nemotron SFT v1 LoRA | 2.10/5 | 25 |
+| 11 | Nemotron SFT v2 LoRA | 2.00/5 | 292 |
+| — | GRPO v3 (off-policy) | N/A | gibberish |
+---
+*Created: 2026-03-19 | Updated: 2026-04-03 21:40 UTC*

docs/FULL_FINETUNE_PLAN_2026-03-20.md ADDED Viewed

	@@ -0,0 +1,76 @@

+# Full Fine-Tune Plan — 2026-03-20
+## Why this run
+- The new v2 dataset passed validation.
+- The current bad v2 model came from a **LoRA** run, not a full fine-tune.
+- So the next clean experiment is: **same good-ish data, different adaptation method**.
+## Recommendation
+Run a **full SFT** on Nemotron 4B with Unsloth using the validated dataset:
+- dataset: `data/interview_segments_v2.jsonl`
+- max seq: `1792`
+- bf16 full fine-tune
+- save checkpoint every `50` steps
+- do not trust loss alone
+- evaluate checkpoint-50 before committing to the whole run
+## Conservative starting hyperparameters
+- epochs: `2`
+- per-device batch: `2`
+- grad accumulation: `8`
+- effective batch: `16`
+- learning rate: `1e-5`
+- warmup steps: `25`
+- optimizer: `adamw_torch`
+## Why these numbers
+- `1792` matches the earlier data-driven sequence choice already used in the repo.
+- `1e-5` is intentionally lower than the LoRA run's `2e-4`; full fine-tune should start less aggressively.
+- `2 x 8` is conservative for a first full-model run even on a 128 GB machine.
+- Once step-50 is stable, batch can be increased if utilization is low.
+## Success gate at step 50
+Checkpoint `checkpoint-50` must be evaluated before we trust the run.
+### Minimum bar
+The step-50 model should:
+- clearly beat the bad v2 LoRA behavior
+- ask short questions instead of monologues
+- ideally approach or exceed the 4.35/5 Nemotron 4B base baseline on the existing eval
+### If step-50 is bad
+Stop and adjust one of:
+- lower LR further (for example `5e-6`)
+- shorten target format / tighten generation template
+- reduce epochs
+- inspect whether training strings should exclude some assistant-heavy tails
+## Launch command
+```bash
+cd /home/bobber/lex-ft && \
+WANDB_RUN_NAME=lex-interviewer-full-sft-v1 \
+OUTPUT_DIR=/home/bobber/lex-ft/checkpoints/lex-interviewer-full-sft-v1 \
+FINAL_DIR=/home/bobber/lex-ft/models/lex-interviewer-full-sft-v1 \
+MAX_SEQ_LENGTH=1792 \
+BATCH_SIZE=2 \
+GRAD_ACCUM=8 \
+EPOCHS=2 \
+LR=1e-5 \
+SAVE_STEPS=50 \
+python3 scripts/train_full_sft.py |& tee logs/train_full_sft_v1.log
+```
+## Step-50 eval procedure
+1. wait for `checkpoints/lex-interviewer-full-sft-v1/checkpoint-50`
+2. export / serve that checkpoint with the same inference path used for leaderboard evals
+3. run:
+```bash
+cd /home/bobber/lex-ft && python3 scripts/eval_via_server.py full-sft-step50
+```
+4. if promising, optionally also run v2 eval:
+```bash
+cd /home/bobber/lex-ft && python3 scripts/eval_v2.py full-sft-step50
+```
+## Important caution
+The current `eval_via_server.py` assumes a llama.cpp-compatible server is already running on `127.0.0.1:30000`. So the missing piece at eval time is not the scorer; it is the checkpoint-serving step.

docs/FUNCTIONAL_EVAL_DESIGN.md ADDED Viewed

	@@ -0,0 +1,134 @@

+# Functional Eval Design — Info-Gain Based Evaluation
+> **Created:** 2026-03-29
+> **File:** `eval_functional.py`
+> **Motivation:** Replace Lex-style eval (circular) with functional quality measurement
+---
+## The Problem with eval_v3
+`eval_v3` (and v2, v1) score interview questions on Lex Fridman style dimensions:
+- Philosophical depth, curiosity, specificity in Lex's style
+- Judged by Claude Opus calibrated on real Lex transcripts
+**The circularity problem:**
+1. Base Nemotron 4B already knows Lex Fridman from internet pretraining
+2. Base model scores **7.39/10** with zero fine-tuning
+3. Training with log-ratio reward also teaches Lex-like outputs
+4. Both training signal and eval signal measure the same thing: "sounds like Lex"
+5. Any improvement in functional quality is invisible to the eval
+**Evidence:** GRPO v11 step_100 scores 7.38/10 — essentially identical to base. Either nothing was learned, or the eval can't see it. We can't tell which.
+---
+## The Fix: Functional Evaluation
+**Core question:** Does the question unlock new relevant information from the guest?
+```
+info_gain = novelty(sim_response vs guest) × relevance(sim_response vs question)
+```
+### Pipeline
+```
+guest_statement
+     │
+     ▼
+[Policy Model] ──generates──► question
+                                   │
+                                   ▼
+                        [Guest Simulator (Qwen 0.5B)]
+                        prompted: "You just said: {guest}
+                                   Follow-up: {question}
+                                   Answer:"
+                                   │
+                                   ▼
+                               sim_response
+                                   │
+                    ┌──────────────┼──────────────┐
+                    ▼              │              ▼
+              embed(guest)   embed(question)  embed(sim_resp)
+                    │              │              │
+                    └──────────────┼──────────────┘
+                                   │
+                    novelty = 1 - cos(sim_resp, guest)
+                    relevance = cos(sim_resp, question)
+                    info_gain = novelty × relevance
+```
+### Why This Works
+- **Novelty:** If the question just paraphrases what the guest said, sim_response will be similar to guest statement → low novelty → penalized
+- **Relevance:** If the question is off-topic/random, sim_response won't be about it → low relevance → penalized
+- **Product:** Only high if question opens a new relevant angle
+- **No Lex advantage:** Score cares about function, not style. Base model's Lex knowledge is irrelevant.
+### Validation (2026-03-29)
+Real Lex questions vs generic LLM questions on same guest statements:
+- Real Lex mean info_gain: **0.368**
+- Generic mean info_gain: **0.069**
+- Diff: **+0.300**, p≈0
+This is the strongest discrimination signal of all reward variants tested (log-ratio: +1.09 nats, NLI: +0.005).
+---
+## Implementation
+```python
+# eval_functional.py
+BASE_MODEL  = 'models/NVIDIA-Nemotron-3-Nano-4B'   # policy model
+SIM_MODEL   = 'Qwen/Qwen2.5-0.5B-Instruct'          # guest simulator (frozen)
+EMBED_MODEL = 'all-MiniLM-L6-v2'                     # sentence embedder
+HELD_OUT    = 'data/held_out_eval.jsonl'             # 50 held-out prompts
+# Usage:
+# python eval_functional.py --model base
+# python eval_functional.py --model results/grpo_v8/ckpt_step_100
+# python eval_functional.py --model base --model2 results/grpo_v8/ckpt_step_100
+```
+### Running with mamba_ssm Mock
+Requires the wrapper script due to compiled extension issues:
+```bash
+HF_MODULES_CACHE=/tmp/hf_modules \
+python /tmp/run_functional_eval.py \
+  --model base \
+  --model2 results/grpo_v8/ckpt_step_100 \
+  --n 25 \
+  --output results/functional_eval_base_vs_step100.json
+```
+See `CURRENT_STATE_2026-03-29.md` for setup details.
+---
+## Interpreting Results
+| info_gain range | Interpretation |
+|---|---|
+| > 0.20 | Strong — question opens genuinely new, relevant territory |
+| 0.10–0.20 | Moderate — some new ground, partially on-topic |
+| < 0.10 | Weak — either repetitive or tangential |
+Expected baseline (base model): ~0.15–0.25 based on real Lex benchmark (0.368).
+---
+## Limitations
+1. **Guest simulator quality:** Qwen 0.5B is a weak simulator. Might miss subtle angles.
+2. **Embedding space:** MiniLM may not capture deep semantic differences.
+3. **Single simulation:** One sim_response per question — stochastic. Could average over 3-5.
+4. **Naive SSM inference:** Without causal_conv1d, generation is slower and slightly different from trained distribution.
+These are known weaknesses. The metric is still far better than the circular Lex-style eval.
+---
+*Created: 2026-03-29 17:00 UTC*

docs/GRPO_V11_DESIGN.md ADDED Viewed

	@@ -0,0 +1,141 @@

+# GRPO v11 Design — Info-Gain Reward + On-Policy vLLM
+> **Status:** 🔄 In progress — 2 runs completed, functional eval pending
+> **Date:** 2026-03-29
+> **Training script:** `grpo_v8_train.py` with `reward_v11.py`
+---
+## Motivation
+Every prior GRPO run (v4–v10) showed positive training rewards but flat or degraded eval scores. Root cause analysis: **the reward was measuring a proxy, not the actual goal.**
+| Reward version | What it measured | Problem |
+|---|---|---|
+| v8 (heuristic) | Structural patterns (?, length, no filler) | Gamed by step 50 |
+| v9 (NLI) | Whether question adds beyond guest | Failed discrimination (p=0.48) |
+| v10 (log-ratio) | How much question sounds like Lex | Stochastic parrot — rewards style not function |
+| **v11 (info-gain)** | **Whether question unlocks new guest info** | **First functional signal** |
+---
+## Reward v11: Info-Gain via Simulated Response
+```
+reward = hard_gates × (info_gain + brevity_bonus + specificity_bonus + diversity_bonus)
+info_gain = novelty(sim_response vs guest) × relevance(sim_response vs question)
+```
+**Hard gates** (reward = 0 if any fires):
+- `?` count > 4 (multi-question dump)
+- Starts with "As Lex Fridman..." (sycophantic framing)
+- Contains stage directions `*(..)*`
+- Starts with "I" (interviewer-centric)
+- < 5 words or > 200 words
+**Info-gain computation:**
+1. Freeze Qwen2.5-0.5B as guest simulator
+2. Prompt it: "You said: {guest}. Follow-up: {question}. Answer:"
+3. Embed {guest}, {question}, {sim_response} with all-MiniLM-L6-v2
+4. `novelty = 1 - cosine(sim_response, guest)` — did response say something new?
+5. `relevance = cosine(sim_response, question)` — was response on-topic?
+6. `info_gain = novelty × relevance`
+**Key property:** base model's knowledge of Lex Fridman gives **zero advantage**. Scorer measures whether the question *works*, not whether it *sounds right*.
+### Validation Experiments
+| Experiment | Result |
+|---|---|
+| Hard gate calibration on 200 real Lex questions | 0.5% false positive (PASS) |
+| NLI discrimination (real Lex vs generic) | Failed (DROPPED from v11) |
+| Log-ratio signal strength | +1.09 nats diff, p≈0 (but parrot problem) |
+| Info-gain discrimination (real Lex vs generic) | +0.300 diff, p≈0 ✅ **STRONGEST signal** |
+| Guest simulator variance across questions | std=31.7 words ✅ responds meaningfully |
+---
+## Training Runs
+### Run 1 (grpo-v11-run1)
+- **Date:** 2026-03-29 04:57–12:48 UTC (~7.8h)
+- **Base:** `results/grpo_v8/ckpt_step_100` (GRPO v8 best checkpoint)
+- **Steps:** 148
+- **Checkpoints:** saved at step 50 and step 100 → `results/grpo_v8/ckpt_step_50/100`
+- **Reward progression:**
+| Step range | Reward mean | n_pos/32 | Overlong/32 |
+|---|---|---|---|
+| 0–10 | 0.81–1.48 | 10–17 | 6–12 |
+| ~50 | ~1.1–1.3 | 13–16 | 10–14 |
+| ~100 | ~1.1–1.4 | 13–17 | 10–16 |
+| 140–147 | ~1.0–1.4 | 13–18 | 7–14 |
+- Rewards consistently positive throughout
+- No collapse (unlike v8 which collapsed at step 50)
+- `n_pos` trend: slow improvement from ~10 → ~15+ over 148 steps
+- **Killed:** gateway restart at step 148 (between step reward and backward pass)
+### Run 2 (grpo-v11-run2)
+- **Date:** 2026-03-29 12:49–13:17 UTC
+- **Base:** `results/grpo_v8/ckpt_step_100` (same starting point)
+- **Steps:** 108 (steps 106–108 logged)
+- **Reward mean at step 108:** 1.241 (healthy)
+- Best question: *"When you describe the guest's 'age' metaphor as something that can be adjusted, what does that mean"*
+- **Killed:** gateway restart mid-generation (no checkpoint saved)
+---
+## Eval Results (v3 scorer, held-out)
+| Model | Score | Avg Words |
+|---|---|---|
+| Base Nemotron 4B | 7.39/10 | 13 |
+| GRPO v11 step_100 | 7.38/10 | 13 |
+**Interpretation:** These scores are likely misleading due to Lex circularity. See `EVAL_RESULTS.md` for full analysis. Functional eval (info-gain) is the correct measurement.
+---
+## Functional Eval (Running 2026-03-29)
+`eval_functional.py` comparing base vs step_100 on 25 held-out prompts.
+Expected outcomes:
+- **step_100 > base:** GRPO worked, v3 eval was blind to it → launch 500-step full run
+- **base ≈ step_100:** Reward didn't teach functional quality → investigate or try full-weight
+- **Both near 0:** Naive SSM inference degraded quality → need llama.cpp path for eval
+Results: **PENDING**
+---
+## Technical Notes
+### mamba_ssm Mock for Local Inference
+Running Nemotron with `transformers.AutoModelForCausalLM` (vs vLLM) requires mocking `mamba_ssm` because compiled extensions don't build for Python 3.13 or torch 2.9+CUDA13:
+```python
+# /tmp/run_functional_eval.py
+# 1. Mock mamba_ssm in sys.modules before any imports
+# 2. Patch modeling_nemotron_h.py in HF_MODULES_CACHE to use try/except for causal_conv1d
+# 3. Use .venv-vllm (Python 3.12) which has peft + sentence_transformers
+```
+Model falls back to naive SSM (slower but correct). Generation validated with test scripts.
+---
+## Next Steps
+1. Read functional eval results (base vs step_100 info-gain)
+2. Based on results:
+   - If training works → 500-step full run with reward_v11
+   - If training doesn't work → SFT from reward_v11 completions (`gen_sft_data.py`)
+   - If inference is degraded → use llama.cpp for eval (merge LoRA → GGUF)
+3. Consider full-weight training (`--lora-r 0`) — LoRA only touches 4 attention layers out of 42 total
+---
+*Created: 2026-03-29 17:00 UTC*

docs/GRPO_V11_POSTMORTEM.md ADDED Viewed

	@@ -0,0 +1,120 @@

+# GRPO v11 Postmortem
+Date: 2026-03-29
+Status: Training failed to beat base on clean eval. Root causes identified.
+---
+## Results Summary
+| Model | Clean eval (held-out) | Leaky eval | Δ vs base |
+|-------|-----------------------|-----------|-----------|
+| Base Nemotron 4B | **7.39/10** | 6.46/10 | — |
+| Step 100 ckpt | **7.38/10** | 6.54/10 | −0.01 |
+**Net result: 100 steps of GRPO v11 = zero measurable improvement on unseen data.**
+The previous "improvements" (6.70, 7.26, 7.28) were eval artifacts.
+---
+## Root Cause 1: Eval Leakage (Masked Everything)
+All 25 eval scenarios were also training prompts (line 147-155 of grpo_v8_train.py).
+The val score was measuring memorization, not generalization.
+Fix: `data/held_out_eval.jsonl` — 50 real transcript prompts, zero overlap with training.
+---
+## Root Cause 2: Off-Policy Drift (Primary Training Failure)
+IS clipped ratio grew from 0.8% (step 0) to 22.1% (step 145).
+By step 100, log_ratio std was ±0.38 and clipped=4.5%.
+| Steps | IS clipped | Effect |
+|-------|-----------|--------|
+| 0–20 | <2% | Gradients valid |
+| 50–80 | 3–5% | Mildly stale |
+| 100–145 | **5–22%** | Gradients increasingly corrupt |
+Cause: vLLM generates from weights at step N. Training does 4 gradient steps
+(micro_batch=8 on 32 completions). By step 4 the policy has drifted from the
+generation distribution. Each step accumulates more drift.
+KL coef 0.15 will slow drift rate but not fix the fundamental off-policy problem.
+---
+## Root Cause 3: LoRA Capacity Too Small
+Weight analysis shows delta norms of 0.01–0.03 after 100 steps.
+KL divergence between base and trained first-token distributions: **0.0013** (near zero).
+LoRA r=32 on a hybrid 38-Mamba + 4-Attn architecture:
+- Attention (4 layers): most weight-updated modules, but fewest layers
+- Mamba (38 layers): most changed by norm, but still tiny deltas
+- Effective trainable parameters: ~41M / 4B = 1.03% — insufficient for style transfer
+Semantic similarity between base and trained outputs: **0.541** (meaningful divergence
+in surface form), but KL=0.0013 means the underlying probability distributions are
+virtually identical. The model is sampling differently but from the same distribution.
+---
+## Root Cause 4: Reward Noise
+Each question scored against one sim response from Qwen 0.5B.
+Single-sample info gain has high variance — same question can score very differently
+across runs depending on the sim's stochastic output.
+This adds noise to the reward signal that the gradient cannot overcome.
+---
+## What Did Work
+- **Reward v11 design**: info gain via simulated response genuinely separates
+  good from bad questions (Exp B: diff=+0.300, p≈0)
+- **Hard gates**: calibrated correctly (0.5% FP on real Lex questions)
+- **vLLM sleep/wake**: solves the OOM issue cleanly
+- **Clean eval infrastructure**: `eval_clean.py` + `held_out_eval.jsonl`
+---
+## Three Paths Forward
+### Option A: SFT on Best GRPO Completions (Recommended First)
+- Collect all completions with reward ≥ 3.0 from run1 logs
+- Build SFT dataset: (guest, high-reward question) pairs
+- Train SFT 1 epoch — no RL, no off-policy issues
+- **Tests: is the reward signal capturing real quality?**
+- If SFT on high-reward completions beats base → reward is good, RL loop is the problem
+- If SFT also fails → reward is not capturing what we want
+### Option B: Full-Weight Training
+- `--lora-r 0` (already implemented in grpo_v8_train.py)
+- All 4B params trainable — Mamba layers get full gradient
+- Needs micro_batch=32 (single gradient step per generation batch)
+- Needs true on-policy: generate and immediately train on same weights
+- Risk: memory, instability, slower
+### Option C: On-Policy GRPO (No vLLM)
+- Generate with HF model directly (no vLLM intermediary)
+- IS ratio = 1.0 always — no off-policy drift possible
+- 10-20× slower generation
+- But clean gradient signal throughout
+**Decision**: Run Option A first (fast, cheap, diagnostic).
+If SFT on high-reward data beats base on clean eval → confirms reward is valid,
+proceed to Option B/C for RL. If not → fundamentally rethink the reward.
+---
+## Lessons
+1. **Always use held-out eval** — never train on eval prompts
+2. **Monitor IS clipped ratio** — >5% is a warning, >10% is broken
+3. **KL coef alone doesn't fix off-policy drift** — need structural fix (micro_batch=32 or true on-policy)
+4. **LoRA r=32 is insufficient for this task** — need r≥128 or full-weight for style transfer on 4B hybrid
+5. **Clean eval before any training decision** — the leaky eval wasted ~10 training runs

docs/GRPO_V21_PLAN.md ADDED Viewed

	@@ -0,0 +1,78 @@

+# GRPO v21 Plan — Longer Thinking Budget + Better Diagnostics (2026-04-03)
+## Why v20 underperformed
+GRPO v20 (`lora/grpo-v20-think2`) finished successfully but evaluated worse than both the base model and LoRA v2 native:
+- Base: **0.753**
+- LoRA v2 native: **0.760**
+- GRPO v20 think2: **0.720**
+The main failure mode was not the reasoning parser anymore — that was fixed. The problem was **truncation**.
+### Verified findings
+- Thinking was correctly enabled via `chat_template_kwargs={"enable_thinking": True}`
+- vLLM needed `structured_outputs_config={"reasoning_parser": "nemotron_v3"}`
+- Nemotron reasoning output often arrives as:
+  - `reasoning text ... </think> final answer`
+  - because the prompt already contains the opening `<think>` token
+- Standalone verification confirmed Nemotron does emit `</think>` and a clean final answer when generation is long enough
+### Truncation analysis (v20)
+Using `logs/grpo_v20_think2.log`:
+- Steps analyzed: **135**
+- Steps where `thinking/token_len_min == 0`: **67** (**49.63%**)
+- Steps with `completions/clipped_ratio > 0`: **68** (**50.37%**)
+- Steps with both conditions: **67**
+- Therefore **100% of zero-thinking steps were clipped steps**
+Conclusion: the 800-token generation cap was too small. Roughly half the GRPO groups had at least one completion clipped before closing the thinking block.
+## v21 changes
+### Generation budget
+- `MAX_NEW_TOKENS`: **1600** (up from 800)
+- `MAX_SEQ`: **3072** (up from 1536)
+- `VLLM_GPU_MEM_UTIL`: **0.35** (up from 0.30)
+### Starting checkpoint
+- Restart from **LoRA v2 native**:
+  - `/home/bobber/lex-ft/lora/sft-lora-v2-native`
+- Do **not** continue from degraded GRPO v20 checkpoint
+### Diagnostics added to W&B
+Keep existing:
+- `thinking/token_len_mean|min|max`
+- `thinking/char_len_mean|min|max`
+- `thinking/present_ratio`
+- question lengths
+New failure-mode-specific metrics:
+- `thinking/nonempty_count`
+- `thinking/closed_tag_ratio`
+- `thinking/missing_ratio`
+- `thinking/unclosed_ratio`
+- `thinking/no_answer_after_close_ratio`
+- `thinking/clipped_and_unclosed_ratio`
+- `answer/nonempty_count`
+- `answer/questionmark_ratio`
+These distinguish:
+1. no thinking at all
+2. thinking started but never closed
+3. think block closed but no final answer
+4. clipped + unclosed specifically
+## Files
+- Training script: `scripts/train_grpo_v20.py` (repurposed as v21 launcher)
+- Launcher: `run_grpo_v20.sh`
+- Planned output: `checkpoints/grpo-v21/`, `lora/grpo-v21/`
+## Launch goal
+Run a fresh GRPO job from LoRA v2 native with verified Nemotron reasoning parsing, enough token budget for the think block to close, and W&B metrics that expose the exact failure mode if it regresses.

docs/GRPO_V21_SUCCESS_ANALYSIS.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Why GRPO v21 Succeeded — Formal Analysis
+**Date:** 2026-04-03
+**By:** post-hoc analysis after v21, v22, v23, v24 experiments
+---
+## Summary
+GRPO v21 scored **0.867 ± 0.231** on thinking-enabled functional eval — the best result across all training runs. This document explains precisely why, backed by measured metrics from training logs across all four runs.
+---
+## The Key Metric: GRPO Learning Quality Score
+We define a compound metric that captures what GRPO actually needs to learn:
+```
+GRPO_score = first50_std × sign(reward_delta) × |reward_delta|^0.5
+```
+Where:
+- `first50_std` = average intra-group reward std across the first 50 training steps
+- `reward_delta` = mean(last 20 steps reward) − mean(first 20 steps reward)
+| run | eval | avg_group_std | first50_std | reward_delta | high_var% | n_big_q | GRPO_score |
+|---|---|---|---|---|---|---|---|
+| **v21** | **0.867** | **0.228** | **0.268** | **+0.083** | **48.5** | **63** | **0.0778** |
+| v22 | 0.813 | 0.151 | 0.156 | +0.013 | 27.5 | 18 | 0.0182 |
+| v23 | 0.760 | 0.186 | 0.187 | -0.019 | 36.0 | 45 | -0.0036 |
+| v24 | 0.693 | 0.208 | 0.195 | -0.018 | 43.0 | 62 | -0.0035 |
+**GRPO_score is nearly perfectly correlated with final eval score.**
+---
+## The Underlying Mechanism
+### What generates high group variance?
+The key measurement: `corr(q_tok, group_std) = +0.70` across all runs.
+- When `q_tok > 100` (meta-spill / clipped text leaks into the "answer" region): `avg_std ≈ 0.41`
+- When `q_tok ≤ 30` (clean short questions): `avg_std ≈ 0.13`
+High-variance "learning steps" are exactly the ones where some completions produce meta-spill or clipped garbage (`reward = 0`) while others produce clean interviewer questions (`reward = 0.7–1.0`).
+Pattern example: `[1.0, 0.0, 1.0, 1.0]` → std = 0.500
+This is not a problem. **This is the GRPO learning signal.** Steps with high intra-group contrast produce the largest advantage estimates, which drive the strongest policy gradient updates.
+---
+## Why v22 Failed Despite Fixing Clipping
+v22 reduced clipping from 32.5% → 9.5%. This seemed like an improvement, but it:
+- Reduced `n_big_q` (steps with high q_tok variance): **63 → 18**
+- Reduced `avg_group_std`: **0.228 → 0.151**
+- Reduced `high_var%`: **48.5% → 27.5%**
+- Reduced `reward_delta`: **+0.083 → +0.013**
+By eliminating natural "contrast generators" (steps where one clipped/failed completion zeroes while three good ones score 0.7–1.0), v22 starved GRPO of learning signal. The run converged but with much weaker gradient push.
+**The right amount of clipping was generating useful contrast. More clipping ≠ better learning.**
+---
+## Why v23 and v24 Failed Despite Similar Group Structure
+v24 had `n_big_q = 62` — essentially the same as v21's 63. The structural conditions looked similar. But:
+| | v21 | v24 |
+|---|---|---|
+| big-q avg_std | **0.410** | 0.386 |
+| big-q avg_mean reward | 0.563 | 0.511 |
+| small-q avg_mean reward | 0.757 | 0.766 |
+| reward_delta | **+0.083** | -0.018 |
+reward_v13's multiplicative soft penalties depressed the top end of the reward distribution:
+- partially-bad completions that reward_v12 zeros → reward_v13 gives 0.6 (reduced contrast)
+- clean good completions get penalized for generic openers → upper bound compressed
+The result: less variance in the high-q_tok steps, and overall reward trending downward. GRPO had no uphill direction to optimize toward.
+---
+## The Formal Model of v21's Success
+v21 succeeded because **all three** of these conditions were simultaneously satisfied:
+### Condition 1: Token budget → right natural failure rate
+- 1600 tokens → ~32% of steps had at least one clipped/failed completion
+- P(at least 1 zero per group of 4) ≈ 0.30 → avg group std ≈ 0.228
+**The Goldilocks zone:**
+- Budget too large (v22): fewer zeros → std collapses → no signal
+- Budget too small (v20): too many zeros → all groups degenerate → no signal
+### Condition 2: reward_v12 creates maximal binary contrast
+- Hard gate: clipped text fails `ends with ?` check → reward = 0.0 exactly
+- Clean questions: reward = 0.7–1.0
+- This binary gap maximizes intra-group advantage
+reward_v13 added partial penalties (×0.10, ×0.35, ×0.70), which smoothed rather than sharpened the contrast. A penalized bad completion at ~0.6 reward creates far less learning signal than a hard-zero.
+### Condition 3: Starting policy at the right distance from the optimum
+- `sft-lora-v2-native` started at reward_first20 = 0.631
+- Clear room to climb to 0.715 → delta = +0.083
+- The reward landscape was still uphill from that starting point
+v23 started from v21 which had already climbed most of the hill. With reward_v13's stricter objective, the starting point was at or above the new optimum → reward went downhill.
+---
+## The Formula
+```
+GRPO_success = Prob(at least 1 zero in group) ≈ 0.25–0.35
+             × hard_binary_reward_gate (zeros are truly zero, goods are 0.7+)
+             × starting_below_optimum (reward can still increase)
+```
+This is not luck. v21 hit a **Goldilocks combination** that maximized GRPO's learning efficiency: enough zeros to create contrast (but not too many), a reward that makes zeros hard and goods strong, and a starting point with room to improve.
+---
+## Implications for Future Runs
+This is fully replicable. To exceed v21:
+1. **Keep a hard binary gate** — zeros when clearly wrong (no partial credit for failures)
+2. **Keep a budget where ~25–33% of steps naturally produce at least one zero**
+3. **Ensure the starting policy has room to improve** (don't start from the current best checkpoint under the same reward)
+4. **Make the reward more discriminative at the top end** — push from 0.7 → 0.95 for genuinely excellent questions, rather than adding penalties at the bottom
+The insight is that GRPO learns best from **contrast**, not from **correctness**. A step where three completions are excellent and one is terrible teaches more than a step where all four are mediocre.
+---
+## Measured Data
+All data extracted from training logs:
+- `/home/bobber/lex-ft/logs/grpo_v21.log`
+- `/home/bobber/lex-ft/logs/grpo_v22.log`
+- `/home/bobber/lex-ft/logs/grpo_v23.log`
+- `/home/bobber/lex-ft/logs/grpo_v24.log`
+Final eval scores from thinking-enabled functional eval using Qwen3.5-4B judges (on_topic × uses_guest × probing) on 25 held-out prompts.

docs/GRPO_V22_PLAN.md ADDED Viewed

	@@ -0,0 +1,61 @@

+# GRPO v22 Plan — 2560 Token Budget + Clip Penalty (2026-04-03)
+## Motivation
+GRPO v21 proved that the thinking-enabled path works and that GRPO can beat the base model under thinking-enabled eval:
+- base (thinking-enabled): **0.760**
+- LoRA v2 native (thinking-enabled): **0.707**
+- GRPO v21 (thinking-enabled): **0.867**
+However, v21 still had clipped long-run samples during training.
+## Measured v21 clipping rate
+From the full 200-step training log:
+- Steps with any clipped completion: **65 / 200 = 32.5%**
+- Steps with `thinking/present_ratio < 1.0`: **63 / 200 = 31.5%**
+- Steps with `thinking/closed_tag_ratio < 1.0`: **63 / 200 = 31.5%**
+- 100% of low-present / low-closed steps overlapped with clipping
+- Average clipped ratio across all steps: **0.092**
+- Steps with 50% or more completions clipped: **9 / 200 = 4.5%**
+So v21 improved substantially over v20, but clipping still polluted a meaningful fraction of training steps.
+## v22 changes
+### Larger generation budget
+- `MAX_NEW_TOKENS = 2560`
+- `MAX_SEQ = 4096`
+This is a step up from v21's 1600/3072, without going all the way to 4096 new tokens.
+### Explicit clipped-completion penalty
+- `CLIP_PENALTY = 0.10`
+Current implementation:
+- detect a locally clipped completion when generated token count is within ~4 tokens of `MAX_NEW_TOKENS`
+- subtract **0.10** from the reward for that completion
+- floor at **0.0** (no negative rewards from clipping alone)
+This is intentionally small:
+- enough to discourage runaway / truncated outputs
+- not so large that mildly clipped but otherwise strong samples dominate the training signal
+### Diagnostics retained / extended
+- `thinking/nonempty_count`
+- `thinking/closed_tag_ratio`
+- `thinking/missing_ratio`
+- `thinking/unclosed_ratio`
+- `thinking/no_answer_after_close_ratio`
+- `thinking/clipped_and_unclosed_ratio`
+- `answer/nonempty_count`
+- `answer/questionmark_ratio`
+- `completion/clipped_ratio_local`
+- `reward/clip_penalty`
+## Files
+- Training script: `scripts/train_grpo_v20.py`
+- Launcher: `run_grpo_v20.sh`
+- Planned output: `checkpoints/grpo-v22/`, `lora/grpo-v22/`

docs/GRPO_V23_PLAN.md ADDED Viewed

	@@ -0,0 +1,68 @@

+# GRPO v23 Plan — reward_v13 as a Clean Reward Ablation from GRPO v21
+## Goal
+Test whether `reward_v13` improves the policy **without changing the generation regime** from the current best run.
+This is intentionally a cleaner experiment than v22:
+- **same generation budget as GRPO v21**
+- **start from GRPO v21**
+- **change reward only**
+## Why this design
+GRPO v21 is still the best checkpoint on thinking-enabled eval:
+- GRPO v21: **0.867 ± 0.231**
+GRPO v22 reduced clipping substantially but underperformed v21:
+- GRPO v22: **0.813 ± 0.314**
+So the next question is not "does a larger budget help?" — v22 already answered that imperfectly.
+The next question is:
+> Can a better reward improve on v21 while keeping the same generation regime?
+## Config
+### Start checkpoint
+- `/home/bobber/lex-ft/lora/grpo-v21`
+### Reward
+- `reward_v13`
+### Generation / sequence length
+- `MAX_NEW_TOKENS=1600`
+- `MAX_SEQ=3072`
+### Other key settings
+- `CLIP_PENALTY=0.10`
+- `VLLM_GPU_MEM_UTIL=0.35`
+- `NUM_PROMPTS=1000`
+- `MAX_STEPS=200`
+## What reward_v13 changes
+Relative to v12:
+- keeps `uses_guest` and `probing`
+- adds explicit penalties for:
+  - meta spill (`the user is asking...`, etc.)
+  - generic opener patterns with weak guest anchoring
+  - obvious drift patterns
+  - excessively long hidden thinking
+- does **not** reward longer thinking directly
+## Success criterion
+Primary:
+- beat GRPO v21 on thinking-enabled functional eval
+Secondary:
+- reduce explicit meta-spill frequency
+- avoid the generic-short-question drift seen in v22
+## Evaluation protocol
+After training, run the same thinking-enabled leaderboard:
+- `base`
+- `/home/bobber/lex-ft/lora/sft-lora-v2-native`
+- `/home/bobber/lex-ft/lora/grpo-v23`

docs/GRPO_V24_PLAN.md ADDED Viewed

	@@ -0,0 +1,33 @@

+# GRPO v24 Plan — reward_v13, reset to LoRA v2 native start
+## Motivation
+v23 showed that starting from grpo-v21 (already a strong policy) + stricter reward_v13
+compressed reward variance → GRPO had less signal to learn from → eval regressed to base.
+The v21 high score (0.867) came from starting at a *weaker* policy (LoRA v2 native) under
+a *simpler* reward (v12), giving wide exploration space with steep advantage gradients.
+Key insight:
+- v21 was v12's local optimum but is NOT guaranteed to be v13's best starting point
+- Resetting to LoRA v2 native restores exploration room under the new reward geometry
+## Config
+- START_ADAPTER: `/home/bobber/lex-ft/lora/sft-lora-v2-native`
+- REWARD_MODULE: `reward_v13`
+- MAX_NEW_TOKENS: 1600
+- MAX_SEQ: 3072
+- CLIP_PENALTY: 0.10
+- NUM_PROMPTS: 1000
+- MAX_STEPS: 200
+## Hypothesis
+reward_v13 will find a different (and hopefully better) local optimum when given a fresh
+exploration budget from a weaker starting point, rather than inheriting v21's policy which
+was already locally optimal under the softer v12 reward.
+## Success criterion
+Beat grpo-v21 (0.867) on thinking-enabled functional eval.

docs/GRPO_V3_POSTMORTEM.md ADDED Viewed

	@@ -0,0 +1,221 @@

+# GRPO v3 Postmortem — Off-Policy RL on Hybrid Mamba Architecture
+> **Status:** ❌ FAILED — LoRA merged model generates gibberish despite positive training rewards
+> **Date:** 2026-03-22 to 2026-03-23
+> **Duration:** 9.72 hours (583 min), 125 steps
+> **wandb:** [`lex-interviewer-grpo-v3`](https://wandb.ai/bobber-cheng/lex-interviewer/runs/zm1khost)
+---
+## Architecture
+```
+┌──────────────┐    completions     ┌──────────────┐
+│  llama.cpp   │ ──────────────────>│   Reward v3  │──── rewards
+│  Q4_K_M base │                    │  (heuristic) │
+│  (no LoRA)   │                    └──────────────┘
+└──────────────┘                            │
+       ▲                                    │
+       │ generation                         ▼
+       │ (off-policy)              ┌──────────────┐
+       │                           │  GRPO Loss   │
+       │                           │  advantages  │
+       │                           └──────┬───────┘
+       │                                  │
+       │                                  ▼
+       │                           ┌──────────────┐
+       │                           │  HF Model    │
+       │                           │  + LoRA      │──── gradient update
+       │                           │  (forward    │
+       │                           │   pass only) │
+       └───────── NOT connected ───┘──────────────┘
+```
+**The fatal flaw:** llama.cpp generates completions from the **base model** (no LoRA). The reward scores those base-model completions. But gradient updates go to the **LoRA model**. The LoRA never generates its own text — it only assigns log-probabilities to text produced by a different model.
+---
+## Training Config
+| Parameter | Value |
+|-----------|-------|
+| Base model | Nemotron-3-Nano-4B (hybrid: 38 Mamba-2 + 4 Attention) |
+| Generation model | Q4_K_M GGUF via llama.cpp (2.9 GB) |
+| Training model | HF + LoRA (rank 32, 0.38% trainable params) |
+| LoRA targets | All layers: q/k/v/o_proj (attention), in/out_proj (Mamba), up/down_proj (MLP) |
+| NUM_PROMPTS | 500 |
+| NUM_GENERATIONS | 8 per prompt |
+| MAX_COMPLETION_TOKENS | 800 |
+| BATCH_SIZE | 4 |
+| GRAD_ACCUM | 4 (effective batch = 16) |
+| Learning rate | 5e-5 |
+| Beta (KL coefficient) | 0.04 |
+| Thinking mode | Enabled |
+---
+## Results
+### Training Metrics (Misleading)
+| Steps | Avg Reward | % Positive |
+|-------|-----------|------------|
+| 26–35 | -0.201 | 0% (cold start) |
+| 36–45 | +0.115 | 90% |
+| 46–55 | +0.179 | 90% |
+| 56–65 | +0.174 | 100% |
+| 66–75 | +0.220 | 100% |
+| 100–125 | +0.087 to +0.359 | mixed |
+Training rewards looked healthy. **But these rewards measured the base model's generation quality, not the LoRA's.**
+### Actual Generation Quality (Ground Truth)
+**Base model (Q4_K_M, llama.cpp):**
+> *"When you say 'failed three times,' what did it feel like in your body? Was it the weight of the third crash?"*
+**LoRA merged model (step 50, GGUF):**
+> `"I think and to "I" "I" "you are, "I" "i ", I, "i. "i" to "i"`
+The LoRA produces complete gibberish. Every checkpoint tested (step 25, 50) showed the same pattern.
+---
+## Root Cause Analysis: 6 Critical Gaps
+### 🔴 Gap 1: Off-Policy Generation (Fatal)
+llama.cpp generates from the **base model** — no LoRA weights applied. The LoRA model is told "make this text more likely" for text it would never produce itself. Over 125 steps, the LoRA drifts into a completely different distribution.
+**In standard GRPO:** The same model that generates completions is the one that gets updated. Policy improvement is on-policy — the model improves at generating text it actually produces.
+**In our setup:** The generator (llama.cpp base) and the learner (HF + LoRA) are two completely different models. The LoRA has no way to self-correct because it never sees what its own generations look like.
+### 🔴 Gap 2: No Reference Policy / KL Divergence (Fatal)
+Real GRPO computes `KL(π_θ || π_ref)` — the divergence between the current policy and a frozen reference copy. This acts as an anchor, preventing the policy from drifting too far.
+The script uses `-β * mean(log_probs)` as a "KL proxy." This is just a confidence regularizer — it penalizes the model for being too certain about anything, but it does NOT measure how far the LoRA has drifted from the base. Without a proper reference, there's no upper bound on divergence.
+**What should have been done:** Store the initial LoRA log-probs (or base model log-probs) as a frozen reference and compute `log_probs_current - log_probs_ref` as the KL penalty.
+### 🔴 Gap 3: Token Truncation to 512 (Severe)
+```python
+inputs = tokenizer(full_text, return_tensors='pt', truncation=True, max_length=512)
+prompt_ids = tokenizer(pt, return_tensors='pt', truncation=True, max_length=384)
+```
+Generation produces up to 800 tokens, but the log-prob forward pass truncates to 512 total (with prompt eating ~384). That leaves ~128 tokens of completion visible to the gradient — but the reward was computed on the full 800-token completion.
+**Effect:** The LoRA optimizes the first ~128 tokens while the reward evaluates the full response. The end of each completion is a gradient-free zone. This creates a systematic mismatch between what's rewarded and what's learned.
+### 🟡 Gap 4: Architecture Mismatch — Mamba Layers Unchanged (Fundamental)
+NemotronH has 42 layers: 38 Mamba-2 + 4 Attention. LoRA can add adapters to linear projections (`in_proj`, `out_proj` in Mamba; `q/k/v/o_proj` in Attention), but:
+- Mamba's core behavior comes from its **SSM recurrence** (`A_log`, `D`, `conv1d`, `dt_bias`) — these are NOT touched by LoRA
+- The `in_proj`/`out_proj` adapters on Mamba layers only change the input/output projections, not the state-space dynamics
+- The 4 attention layers are the only layers where LoRA meaningfully alters the computation
+So LoRA modifies the periphery of 38 layers and the core of 4 layers. In generation, the 38 untouched Mamba layers still dominate the sequence modeling. The 4 LoRA-modified attention layers can't override what the Mamba layers decide.
+### 🟡 Gap 5: Token-Level vs Sequence-Level Credit Assignment
+The reward function scores the complete visible response (sequence-level: "is this a good Lex question?"). But the loss distributes this reward equally across all tokens:
+```python
+token_lp = torch.gather(log_probs, 1, completion_ids.unsqueeze(1)).squeeze(1)
+log_probs_list.append(token_lp.mean())  # <-- equal weight per token
+```
+In a +0.7 completion, every token gets the same +0.7 advantage — including "The", "a", filler words, punctuation. No credit assignment to the tokens that actually made the response good (the question itself, specific word choices).
+This is standard in sequence-level GRPO, but combined with the other gaps, it means the gradient signal is diffuse and noisy.
+### 🟡 Gap 6: Thinking Content in Training, Not in Reward
+```python
+completion_texts.append(c['raw'])  # includes <think>...</think>
+```
+The LoRA computes log-probs over the full completion including `<think>` tags. But the reward function calls `strip_think()` and scores only the visible content. This means:
+- 400+ tokens of thinking content get gradient signal proportional to the visible-content reward
+- The LoRA is optimizing thinking patterns based on a reward that doesn't evaluate thinking
+- A completion with brilliant thinking but weak visible output would train the LoRA to produce that exact thinking — even though the reward says it's bad
+---
+## Compound Effect
+These gaps compound multiplicatively:
+1. **Gaps 1+2:** Off-policy generation + no KL anchor = unconstrained divergence. The LoRA wanders freely.
+2. **Gap 3:** Truncation means the gradient operates on different tokens than the reward evaluates.
+3. **Gap 6:** Thinking tokens dilute the gradient signal — most of the sequence is thinking, but the reward ignores it.
+4. **Gap 4:** Even if the gradients were correct, LoRA can barely influence generation on this architecture.
+5. **Gap 5:** Even the correct tokens get uniform gradient weight, with no credit assignment.
+The result: gradients that are wrong (off-policy), on the wrong tokens (truncated), for the wrong content (thinking), with the wrong weights (uniform), updating the wrong layers (LoRA on Mamba periphery). The only surprise is that it took 50 steps to produce gibberish rather than happening immediately.
+---
+## GGUF Converter Fix (Side Quest)
+During evaluation, we discovered a bug in llama.cpp's `convert_hf_to_gguf.py` for NemotronH non-MoE models:
+**Root cause:** `NemotronHConfig` in HuggingFace transformers defines MoE default values (`num_experts_per_tok=2`, `moe_intermediate_size=7688`). The converter uses `AutoConfig.from_pretrained()` which loads these defaults even for non-MoE models. The converter then detects `"num_experts_per_tok" in hparams` and sets `architecture = nemotron_h_moe` instead of `nemotron_h`.
+**Fix applied:**
+1. Override MoE defaults in `config.json`: set `num_experts_per_tok=0`
+2. Patch converter to check `hparams.get("num_experts_per_tok", 0) > 0` instead of `"num_experts_per_tok" in hparams`
+3. Guard the MoE metadata section in `set_gguf_parameters` with `if "num_experts_per_tok" in self.hparams:`
+**Files modified:** `/home/bobber/llama.cpp/convert_hf_to_gguf.py` (backup at `.bak`)
+---
+## What Would Fix This
+### Option A: On-Policy GRPO
+Periodically merge LoRA → GGUF → use merged model for generation. Expensive (merge + convert every N steps) but fixes Gap 1.
+### Option B: SFT on Curated Completions
+Use the ~1000 high-reward completions from training as supervised data. No off-policy gap, no reward mismatch. The base model generated excellent Lex-style questions — just teach the LoRA to produce them via SFT.
+### Option C: Ship the Base Model
+The base Nemotron 4B already scores **4.35/5** — better than GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro on this task. System prompt alone achieves top-tier interviewer behavior. Fine-tuning may not be necessary.
+---
+## Files
+| File | Purpose |
+|------|---------|
+| `scripts/train_grpo_v3.py` | Training script (the one with all the gaps) |
+| `scripts/run_grpo_v3.sh` | Detached wrapper (setsid+nohup) |
+| `scripts/reward_v3.py` | Reward function (heuristic scoring) |
+| `logs/grpo_v3_completions.jsonl` | All logged completions + rewards |
+| `logs/run_grpo_v3.log` | Training log |
+| `models/lex-interviewer-grpo-lora-v3` | Final LoRA adapter (125 steps) |
+| `models/lex-interviewer-grpo-lora-v3-step{25,50,75,100,125}` | Checkpoints |
+---
+## Lessons for Future RL Projects
+1. **Off-policy RL requires explicit policy constraints.** Without proper KL divergence from a reference, the policy will diverge. The "KL proxy" shortcut doesn't work.
+2. **Validate generation quality mid-training.** We didn't merge and test the LoRA until step 50. If we had tested at step 10, we'd have caught the gibberish 100 steps earlier.
+3. **Training rewards ≠ model quality** when the generator and learner are different models. Always eval the actual learner, not just the reward signal.
+4. **Hybrid architectures (Mamba + Attention) are hostile to LoRA-based RL.** The SSM dynamics that control generation are not reachable by LoRA adapters. Full parameter updates or architecture-aware training is needed.
+5. **Token truncation in log-prob computation is a silent killer.** If max_length in training doesn't match max_tokens in generation, the gradient literally operates on the wrong tokens.
+6. **Thinking mode adds complexity to RL.** When the reward ignores thinking content but the gradient doesn't, you're training the model to optimize a signal it can't see.
+---
+*Created: 2026-03-23*

docs/GRPO_V4_DESIGN.md ADDED Viewed

	@@ -0,0 +1,423 @@

+# GRPO v4 Design — llama.cpp Generation + PyTorch LoRA Training
+> Date: 2026-03-24
+> Status: Design (paper version)
+## Overview
+On-policy GRPO using llama.cpp for generation and PyTorch for LoRA training. This avoids the SM 12.1 toolchain gaps that blocked NeMo RL, vLLM-based approaches, and custom CUDA kernels on GB10.
+## Why This Works on GB10
+| Component | Framework | SM 12.1 Status |
+|-----------|-----------|----------------|
+| Generation | llama.cpp | ✅ Native support |
+| Log-probs | llama.cpp | ✅ `--logprobs` flag |
+| Model loading | llama.cpp | ✅ GGUF + LoRA adapter |
+| Training forward | PyTorch (torch_forward) | ✅ Pure PyTorch, no custom CUDA |
+| Training backward | PyTorch autograd | ✅ Standard |
+| LoRA conversion | Python (tensor I/O) | ✅ No GPU needed |
+## Architecture
+```
+┌─────────────────────────────────────────────────────┐
+│                    GRPO Training Loop                │
+│                                                      │
+│  ┌──────────────┐    ┌──────────────┐               │
+│  │  llama.cpp   │    │   PyTorch    │               │
+│  │  server      │    │   Training   │               │
+│  │              │    │              │               │
+│  │ base.gguf    │    │ HF model     │               │
+│  │ + lora.gguf  │    │ + LoRA       │               │
+│  │              │    │              │               │
+│  │ Generate     │    │ Forward      │               │
+│  │ completions  │───▶│ Compute loss │               │
+│  │ + log-probs  │    │ Backward     │               │
+│  │              │    │ Update LoRA  │               │
+│  │              │◀───│ Write GGUF   │               │
+│  │ Hot-swap     │    │              │               │
+│  │ /lora-adapt  │    │              │               │
+│  └──────────────┘    └──────────────┘               │
+│         GPU                  GPU                     │
+│    (unified memory)    (unified memory)              │
+└─────────────────────────────────────────────────────┘
+```
+## Training Loop (Pseudocode)
+```python
+# === INITIALIZATION ===
+# 1. Start llama.cpp server with base model + initial LoRA (identity)
+server = start_llama_server(
+    model="models/nemotron-f16.gguf",  # BF16 GGUF — matches PyTorch training precision
+    lora="lora/current.gguf",   # starts as identity (zeros)
+    port=8080,
+    ctx_size=1024,
+    n_gpu_layers=-1,            # all layers on GPU
+)
+# 2. Load HF model for training (torch_forward, no CUDA kernels)
+hf_model = AutoModelForCausalLM.from_pretrained(
+    "models/NVIDIA-Nemotron-3-Nano-4B",
+    torch_dtype=torch.bfloat16,
+    trust_remote_code=True,
+    device_map="cuda",
+)
+# 3. Apply LoRA to HF model
+lora_model = apply_lora(hf_model, rank=64, alpha=256, target="all_linear")
+# 4. Load reference model log-probs (frozen copy for KL penalty)
+# Option A: Use llama.cpp with base model only (no LoRA) for reference
+# Option B: Cache reference log-probs per prompt (cheaper)
+# 5. Reward function
+def reward_fn(prompt, completion):
+    """Score interviewer quality: question relevance, brevity, not lecturing."""
+    score = 0.0
+    # ... (custom scoring logic)
+    return score
+# 6. Optimizer
+optimizer = torch.optim.AdamW(lora_model.lora_parameters(), lr=5e-6)
+# === TRAINING LOOP ===
+for step in range(num_steps):
+    log.info(f"=== Step {step} ===")
+    # ── Phase 1: Sample prompts ──
+    prompts = sample_prompts(dataset, batch_size=num_prompts)
+    log.info(f"Sampled {len(prompts)} prompts")
+    # ── Phase 2: Generate completions (llama.cpp, on-policy) ──
+    completions = []
+    gen_logprobs = []
+    for prompt in prompts:
+        for _ in range(num_generations_per_prompt):
+            result = llama_generate(
+                server,
+                prompt=prompt,
+                max_tokens=800,
+                temperature=1.0,
+                logprobs=True,
+            )
+            completions.append(result.text)
+            gen_logprobs.append(result.logprobs)
+    log.info(f"Generated {len(completions)} completions, "
+             f"avg length: {mean_tokens(completions)}")
+    # ── Phase 3: Compute rewards ──
+    rewards = [reward_fn(p, c) for p, c in zip(repeated_prompts, completions)]
+    log.info(f"Rewards: mean={mean(rewards):.3f}, std={std(rewards):.3f}")
+    # ── Phase 4: Compute advantages (GRPO) ──
+    # Group by prompt, normalize within group
+    advantages = compute_grpo_advantages(
+        rewards,
+        num_prompts=num_prompts,
+        num_generations=num_generations_per_prompt,
+    )
+    # ── Phase 5: Compute training log-probs (PyTorch, current policy) ──
+    train_logprobs = []
+    for prompt, completion in zip(repeated_prompts, completions):
+        tokens = tokenizer.encode(prompt + completion)
+        with torch.no_grad():  # just log-probs, not training yet
+            logits = lora_model(tokens)
+            lp = compute_token_logprobs(logits, tokens)
+        train_logprobs.append(lp)
+    # ── Phase 6: Compute reference log-probs (base model, no LoRA) ──
+    ref_logprobs = []
+    for prompt, completion in zip(repeated_prompts, completions):
+        result = llama_logprobs_only(
+            server_ref,  # separate server or base model without LoRA
+            prompt=prompt,
+            completion=completion,
+        )
+        ref_logprobs.append(result)
+    # ── Phase 7: GRPO policy gradient loss ──
+    optimizer.zero_grad()
+    total_loss = 0.0
+    for i in range(len(completions)):
+        # Policy ratio
+        ratio = torch.exp(train_logprobs[i] - gen_logprobs[i].detach())
+        # Clipped ratio
+        clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
+        # Policy loss (PPO-style with GRPO advantages)
+        pg_loss = -torch.min(ratio * advantages[i], clipped * advantages[i])
+        # KL penalty (against reference policy)
+        kl = train_logprobs[i] - ref_logprobs[i]
+        kl_loss = kl_coef * kl
+        loss = (pg_loss + kl_loss).mean()
+        total_loss += loss
+    total_loss = total_loss / len(completions)
+    total_loss.backward()
+    grad_norm = torch.nn.utils.clip_grad_norm_(
+        lora_model.lora_parameters(), max_norm=1.0
+    )
+    optimizer.step()
+    log.info(f"Loss: {total_loss.item():.4f}, Grad norm: {grad_norm:.4f}")
+    # ── Phase 8: Sync LoRA to llama.cpp ──
+    write_lora_gguf(lora_model, "lora/current.gguf")
+    hot_swap_lora(server, "lora/current.gguf")
+    log.info(f"LoRA synced to llama.cpp server")
+    # ── Phase 9: Periodic eval ──
+    if step % eval_every == 0:
+        eval_score = run_eval(server, eval_prompts)
+        log.info(f"Eval score: {eval_score:.2f}/5")
+```
+## Key Design Decisions
+### 1. Log-prob Source Consistency
+**Problem:** GRPO needs log-probs from the *generating* policy. If llama.cpp generates and PyTorch computes training log-probs, they must be consistent enough for the ratio `π_new/π_old` to be meaningful.
+**Approach:**
+- Generation log-probs: from llama.cpp (during generation, free)
+- Training log-probs: from PyTorch (needs gradient, must use PyTorch)
+- Reference log-probs: from llama.cpp base model (no LoRA)
+**Approach:** Use BF16 GGUF for generation (not Q8) to eliminate precision mismatch. Both llama.cpp and PyTorch operate on identical BF16 weights, so log-probs should be numerically close. Minor differences may still arise from different attention implementations (llama.cpp's custom kernels vs PyTorch's eager), but these should be small enough for stable training.
+**Validation:** Gap test #1 measures the actual divergence between llama.cpp BF16 and PyTorch BF16 log-probs.
+### 2. Reference Policy
+**Options:**
+- **Option A (simple):** Start a second llama.cpp server with just the base model (no LoRA). Compute reference log-probs via API. Costs ~8 GB extra VRAM.
+- **Option B (cheaper):** Use PyTorch base model (before LoRA) for reference. Run once at start, cache reference log-probs per training sample.
+- **Option C (NeMo RL approach):** Don't use a separate reference model. Compute KL from the ratio of current vs generation-time log-probs.
+Recommend **Option A** — simplest, 8 GB is affordable on 130 GB.
+### 3. Generation Speed vs On-Policy Correctness
+llama.cpp BF16 generates at ~30 tok/s (slower than Q8's ~60 tok/s due to larger model). For a batch of 16 prompts × 8 generations × 800 tokens:
+- Total tokens: 102,400
+- Time: ~57 minutes per step
+This is slow but **correct** (on-policy, precision-matched). The GRPO v3 failure was fundamentally about off-policy, not speed.
+**Speedup options (if needed later):**
+- Reduce `num_generations_per_prompt` from 8 to 4 (~28 min/step)
+- Reduce `max_tokens` from 800 to 400 (~28 min/step)
+- Batch multiple prompts via llama.cpp server (concurrent requests)
+- Use Q8 if log-prob divergence test shows it's acceptable (2x faster)
+### 4. LoRA Architecture
+Based on NemotronH architecture (42 layers: 38 Mamba-2 + 4 Attention):
+```yaml
+LoRA config:
+  rank: 64
+  alpha: 256  # scaling = 4x
+  target_modules: all linear layers
+  # On GB10, torch_forward gives gradients through ALL layers
+  # Unlike CUDA kernel path, no need to exclude out_proj
+  exclude_modules: []
+```
+**Memory estimate:**
+- LoRA params (rank 64, all linear): ~100 MB
+- Optimizer states (AdamW): ~200 MB
+- HF model (BF16): ~8 GB
+- llama.cpp base (BF16 GGUF): ~8 GB
+- llama.cpp reference (BF16 GGUF): ~8 GB (if Option A)
+- Activations/gradients: ~5-10 GB
+- **Total: ~35-40 GB | Free: ~90 GB** ← very comfortable
+### 5. Reward Function (Lex Interviewer)
+```python
+def interviewer_reward(prompt: str, completion: str) -> float:
+    """
+    Score how well the completion acts as an interviewer.
+    Criteria:
+    1. Asks a question (not lectures) → binary check
+    2. Question relevance to conversation → semantic similarity
+    3. Brevity (good interviewers are concise) → length penalty
+    4. Follow-up quality (builds on previous answer) → coherence
+    5. Not repetitive → novelty check
+    """
+    score = 0.0
+    # Must contain a question
+    if "?" in completion:
+        score += 1.0
+    # Brevity bonus (under 100 words is good)
+    words = len(completion.split())
+    if words < 50:
+        score += 1.0
+    elif words < 100:
+        score += 0.5
+    elif words > 200:
+        score -= 1.0  # lecturing penalty
+    # Not starting with "用户问" or similar template patterns
+    if not completion.strip().startswith(("用户问", "User asks", "The user")):
+        score += 1.0
+    # Ends with a question (interviewer should be prompting, not concluding)
+    sentences = completion.strip().split(".")
+    if sentences[-1].strip().endswith("?"):
+        score += 1.0
+    # Quality check via heuristic (or LLM judge, more expensive)
+    score += heuristic_quality_score(prompt, completion)  # 0-1
+    return score  # 0-5 range
+```
+### 6. Logging & Tracing
+Every step logs:
+```
+Step | Gen Time | Train Time | Sync Time | Loss | Grad Norm | Reward (mean/std) | Eval Score
+  0  |  28m     |    45s     |    2s     | 2.31 |   0.42    |   2.1 / 0.8       |   4.35
+  1  |  28m     |    45s     |    2s     | 2.15 |   0.38    |   2.4 / 0.7       |   --
+  ...
+```
+W&B integration for:
+- Reward distribution per step
+- Loss curves
+- Generation samples (text)
+- Log-prob divergence (llama.cpp vs PyTorch)
+- LoRA weight norms per layer
+- Eval score trend
+## Gaps to Fill Before Implementation
+1. **Log-prob consistency test:** Generate with llama.cpp BF16 GGUF, compute log-probs in both llama.cpp and PyTorch BF16. Measure divergence. Both use identical precision — divergence should be minimal.
+2. **LoRA ↔ GGUF conversion:** Write `write_lora_gguf()` function. Verify llama.cpp loads the adapter and output changes.
+3. **HF model loading with torch_forward:** Confirm model loads and trains without causal_conv1d (it should fall back, but need to verify loss is reasonable and gradients flow through all 42 layers).
+4. **Reward function tuning:** The heuristic reward above is a starting point. May need LLM-as-judge for quality scoring.
+## Comparison with Previous Attempts
+| | GRPO v3 (failed) | NeMo RL (blocked) | GRPO v4 (this) |
+|---|---|---|---|
+| On-policy | ❌ llama.cpp ≠ HF | ✅ vLLM = HF | ✅ llama.cpp + LoRA = HF + LoRA |
+| KL reference | ❌ None | ✅ Built-in | ✅ Base model via llama.cpp |
+| LoRA coverage | 4/42 layers | All layers | All layers (torch_forward) |
+| SM 12.1 | Partial | ❌ Blocked | ✅ All components work |
+| Gen speed | ~60 tok/s (Q8) | ~3.5 tok/s (vLLM) | ~30 tok/s (BF16 GGUF) |
+| Complexity | Custom script | Full framework | Custom script (simpler) |
+## Design Q&A
+### Q: Can llama.cpp read HF model + LoRA directly without GGUF conversion?
+**No.** llama.cpp only reads GGUF format. But the base model is already converted (once, never changes). Only the LoRA adapter needs converting each step — ~100 MB of tensors, takes seconds. No restart needed (hot-swap via `/lora-adapters` endpoint or file overwrite).
+### Q: llama.cpp uses Q8 — should PyTorch training use Q8 too?
+**No — and this is why we use BF16 GGUF for generation.** Training requires full precision (BF16) for gradient computation. You can't backprop through quantized weights. If we generated with Q8 but trained with BF16, we'd have a base model precision mismatch: the Q8 and BF16 models produce different log-probs for the same input, making the policy ratio `π_new/π_old` noisy. This is a softer version of the same off-policy problem that killed v3.
+**Fix:** Use **BF16 GGUF** (`nemotron-f16.gguf`, ~8 GB) for generation instead of Q8 (~4.5 GB). We have 130 GB unified memory — 8 GB is nothing. Generation is slower (~30 tok/s vs 60 tok/s) but eliminates the precision mismatch entirely. Both llama.cpp generation and PyTorch training see identical BF16 weights + LoRA.
+### Q: llama.cpp supports LoRA fine-tuning (`llama-finetune`). Why do we need PyTorch?
+Two reasons:
+1. **llama-finetune crashes on NemotronH** — buffer size computation bug for Mamba-2 architecture (tested, produces near-max uint64 allocation). This is a llama.cpp bug, not fundamental.
+2. **llama-finetune only does SFT, not GRPO.** There's no RL algorithm in llama.cpp. GRPO requires computing advantages, policy ratios, KL penalties — that's non-trivial math that PyTorch autograd gives us for free.
+If someone fixed llama-finetune for NemotronH AND implemented GRPO in C++, we wouldn't need PyTorch at all. But that's significant development effort vs using PyTorch's existing autograd.
+## Risk Assessment
+| Risk | Severity | Mitigation |
+|------|----------|------------|
+| Log-prob divergence (llama.cpp vs PyTorch) | Low | Both BF16; test measures actual gap |
+| torch_forward training is slow | Low | Only for backward pass, not generation |
+| Reward function too noisy | Medium | Start with simple heuristics, iterate |
+| LoRA GGUF hot-swap has bugs | Low | Test with llama.cpp first |
+| 28 min/step too slow for iteration | Low | Reduce batch size for early experiments |
+---
+## Gap Test Results (2026-03-24)
+**All 3 tests pass ✅**
+### Critical Fix: mamba_ssm Mock
+NVIDIA's custom `modeling_nemotron_h.py` hard-requires `mamba_ssm` at import time. Since the CUDA kernels don't build on SM 12.1, we mock `mamba_ssm` at import time with:
+1. Fake module hierarchy (`mamba_ssm.ops.triton.layernorm_gated`, etc.)
+2. Real `rmsnorm_fn` implementation copied from [mamba_ssm reference code](https://github.com/state-spaces/mamba/blob/main/mamba_ssm/ops/triton/layernorm_gated.py) — pure PyTorch, uses `einops`
+3. `selective_state_update = None` → model falls back to `torch_forward`
+Also uninstalled broken `causal_conv1d` and `mamba_ssm` packages that had non-functional `.so` files.
+### Test Results
+| Test | Result | Key Metric |
+|------|--------|------------|
+| Log-prob consistency | ✅ PASS | Perplexity = 4,595 (realistic) |
+| torch_forward training | ✅ PASS | Loss = 8.45, 92 LoRA layers with gradients |
+| LoRA GGUF tooling | ✅ PASS | llama.cpp `--lora` + `convert_lora_to_gguf.py` ready |
+### Memory Usage (Test 2)
+- Model load: 15.08 GB (BF16, all 42 layers)
+- After LoRA: 8.29 GB (PEFT wraps efficiently)
+- Peak (forward): 11.69 GB
+- After optimizer step: 8.98 GB
+- LoRA params: 82.7M / 4.06B total (2.04%)
+## Smoke Test — PASSED ✅ (2026-03-24 05:32 UTC)
+Full end-to-end pipeline validated with minimal parameters:
+```
+Config: 1 step, 1 prompt, 2 generations, 32 max tokens
+```
+### Pipeline Execution
+| Phase | Time | Status |
+|-------|------|--------|
+| Model load (HF + LoRA) | 72s | ✅ 82.7M trainable params, 8.29 GB GPU |
+| llama.cpp servers start (policy + reference) | 12s | ✅ Both on ports 8090/8091 |
+| Generate 2 completions | 5.3s | ✅ avg 28 words |
+| Compute rewards | <1s | ✅ mean=2.0/5 |
+| Forward + backward | 21.4s | ✅ |
+| LoRA save + GGUF convert | 8.0s | ✅ adapter saved, GGUF written |
+| Server restart with new LoRA | 6s | ✅ hot-swap works |
+| **Total step** | **41s** | ✅ |
+### Sample Output
+```
+Prompt: "I wouldn't say that I do know. In normal social circumstances,
+         we have evolved mechanisms to keep pe..."
+Completion: "And also because the internet is an asynchronous medium, so
+            there's no way to see what I'm going to do..."
+Reward: 2.0/5
+```
+### Notes
+- Loss = 0.0 because both completions got identical reward → GRPO advantage = 0 → no gradient. Expected with single prompt. Real training with diverse prompts will produce non-zero gradients.
+- LoRA GGUF conversion uses `convert_lora_to_gguf.py` from llama.cpp — works correctly.
+- Server restart cycle (stop → start with new LoRA) takes ~6s. Could optimize with hot-swap API later.
+---
+*Next: Real training run with proper batch sizes.*

docs/GRPO_V4_POSTMORTEM.md ADDED Viewed

	@@ -0,0 +1,129 @@

+# GRPO v4 Postmortem — PyTorch Mock Training is Broken
+> Date: 2026-03-24
+> Status: FAILED — pivoting to Option 2 (SFT Distillation)
+## What Happened
+GRPO v4 ran 24 steps over 6 hours. Loss oscillated wildly (-6.5 to +8.5), reward stayed flat (~1.5-2.0), grad norms exploded (up to 1692). Diagnostic testing revealed the root cause.
+## Root Cause: The rmsnorm_fn Mock Produces a Broken Model
+### The Dependency Chain
+NVIDIA's custom `modeling_nemotron_h.py` imports at module level:
+```python
+from mamba_ssm.ops.triton.layernorm_gated import rmsnorm_fn  # line 63
+from causal_conv1d import causal_conv1d_fn, causal_conv1d_update  # line 68
+```
+Both `mamba_ssm` and `causal_conv1d` contain CUDA/Triton kernels that don't compile on GB10 (SM 12.1). So we mocked them.
+### What rmsnorm_fn Does
+`rmsnorm_fn` is a **gated RMSNorm** — not standard RMSNorm. It's called inside every Mamba-2 mixer block:
+```python
+# Inside NemotronHMamba2Mixer.torch_forward():
+scan_output = self.norm(y, gate)  # calls rmsnorm_fn(x=y, z=gate, ...)
+```
+Parameters:
+- `x`: hidden states to normalize
+- `weight`: learnable scale
+- `z`: gate tensor (multiplied via SiLU activation)
+- `group_size`: sub-group normalization (key for Mamba-2)
+- `norm_before_gate`: whether to norm then gate, or gate then norm
+### Why the Mock Failed
+Our mock used the reference implementation from `mamba_ssm/ops/triton/layernorm_gated.py`:
+```python
+def rmsnorm_fn(x, weight, z=None, eps=1e-6, group_size=None, norm_before_gate=True):
+    # ... pure PyTorch reimplementation
+```
+**Diagnostic results:**
+- Model generates **50 spaces** instead of text (greedy decoding)
+- Per-token log-probs: -4 to -12 (expected: -1 to -3)
+- Top-5 predictions at every position: punctuation (`?`, `,`, `:`)
+- "Consciousness" ranked **2833rd** when it should be top-10
+- CE loss: 8.51 (expected: ~3-4 for a working 4B model)
+The mock produces outputs that are **structurally different** from the real Triton kernel. Possible causes:
+1. **Numerical precision**: Triton kernel uses fused FP32 accumulation; our mock does sequential PyTorch ops in BF16→FP32→BF16
+2. **Group normalization details**: The `group_size` parameter interacts with head dimensions in Mamba-2; slight mishandling corrupts the hidden state
+3. **Stateful Mamba dependencies**: The Mamba-2 scan operation accumulates state across positions. Small norm errors compound across the 38 Mamba layers
+### Why This Matters
+The mock produced a model that **looks like it loads correctly** (all 263 weights, 42 layers, no errors) but **behaves like a random model**. This is worse than a crash — it silently trains on garbage signal for hours.
+The GRPO training loop was architecturally correct (on-policy generation via llama.cpp, LoRA update, GGUF sync). But the PyTorch training model couldn't produce meaningful log-probs, so the policy gradient signal was noise.
+## What We Tried (Chronological)
+| Attempt | Blocker |
+|---------|---------|
+| NeMo RL bare metal | `deep_ep` won't compile on aarch64 |
+| NeMo RL Docker | Triton JIT fails on SM 12.1 |
+| PyTorch + trust_remote_code | `causal_conv1d` broken .so |
+| Uninstall broken packages | `mamba_ssm` hard-required at import |
+| Native transformers (no trust_remote_code) | `-` in pattern = `mlp`, but `mlp` not a valid block type |
+| Mock mamba_ssm | Model loads but outputs garbage |
+| GRPO v4 with mock | 24 steps, loss oscillates, no convergence |
+**Every path to "train Nemotron-4B in PyTorch on GB10" is blocked by the same root cause: the Mamba-2 CUDA kernels don't work on SM 12.1, and there's no correct pure-PyTorch fallback.**
+## Options Going Forward
+### Option 1: Different Model Entirely
+Pick a standard transformer that works on GB10 (Qwen3.5-4B, Llama, Gemma). No Mamba-2, no custom kernels. Full GRPO via PyTorch + vLLM (which works on the routangseng venv with torch 2.10 + vLLM 0.18.0).
+- **Pro**: Clean slate, everything works natively
+- **Con**: Lose the Nemotron-4B base quality (4.35/5 eval score)
+- **Effort**: Medium (rewrite reward function, set up training)
+### Option 2: SFT Distillation via llama.cpp ⭐ RECOMMENDED
+Use Nemotron-4B as a **data generator** (via llama.cpp, which works perfectly). Generate thousands of high-quality interviewer completions. SFT a standard transformer on those completions.
+- **Pro**: Uses Nemotron's strength (generation) without fighting its training
+- **Pro**: No mocks, no CUDA kernel issues
+- **Pro**: SFT is proven on GB10 (we've done it 5 times before)
+- **Con**: Still need a trainable student model
+- **Effort**: Low-Medium
+Pipeline:
+```
+Nemotron-4B (llama.cpp) → Generate 10K+ interview completions
+    → Filter by reward function (keep score ≥ 4)
+    → SFT train a standard model (Qwen3.5-4B or similar)
+    → Evaluate → iterate
+```
+### Option 3: Cloud GPU for Training
+Rent A100/H100 ($1-3/hr), train with NeMo RL properly, deploy on GB10.
+- **Pro**: All CUDA kernels work on SM 8.x/9.x
+- **Pro**: NeMo RL designed for this exact use case
+- **Con**: Extra cost, setup overhead
+- **Effort**: Medium
+### Option 4: Ship Base Model
+Nemotron-4B base already scores 4.35/5. Ship it.
+- **Pro**: Zero effort, already best-in-class
+- **Con**: No fine-tuning, no persona customization
+- **Effort**: None
+## Key Lessons
+1. **A model that loads without errors can still be completely broken.** Weight loading success ≠ correct inference. Always validate generation quality before training.
+2. **Mocking CUDA kernels is dangerous.** The mock passed all unit tests (forward works, backward works, gradients flow) but produced garbage outputs. Numerical correctness requires exact implementation matching.
+3. **Know when to stop fighting the hardware.** We spent 24+ hours across multiple approaches trying to make Nemotron-4B trainable on GB10. The hardware (SM 12.1) is simply too new for the Mamba-2 CUDA ecosystem. Use tools that work.
+4. **Separate generation from training.** llama.cpp handles Nemotron-4B inference perfectly. PyTorch can't. Use each tool for what it's good at.

docs/GRPO_V7_DESIGN.md ADDED Viewed

	@@ -0,0 +1,164 @@

+# GRPO v7 Design
+Created: 2026-03-26
+Status: Run10 live (W&B `wejcyyj5`)
+---
+## Problem Statement
+GRPO v3–v6 all failed due to the same root bug: **gen_max_tokens was far too small**.
+Nemotron 4B thinks before answering. A typical generation:
+- Thinking phase: 600–1100 tokens
+- Answer (interviewer question): 20–100 tokens
+- **Minimum needed: ~700–1200 tokens total**
+Previous runs used 300–800 tokens. Result: model was always cut off mid-think, `</think>` never appeared, `strip_thinking()` returned empty string, reward=0 on every completion. GRPO had nothing to learn from.
+---
+## v7 Changes vs v6
+| Aspect | v6 | v7 |
+|---|---|---|
+| Architecture | Full fine-tune (42 layers) | LoRA r=32 (4 attn layers, 1.03% params) |
+| gen_max_tokens | 300–800 | 4000 |
+| Generation | HF model (broken P(</think>)) | llama.cpp server |
+| strip_thinking | regex remove | extract after </think> |
+| LR schedule | flat 1e-5 | linear warmup 30 steps → cosine |
+| kl_coef | 0.02–0.1 | 0.05 |
+| generations | 4–8 | 4 |
+---
+## Generation: llama.cpp Server
+The HF model in Python has P(`</think>`) ≈ 0 from the first decode token. This is model behavior — NVIDIA trained it with `selective_state_update` CUDA kernel. Without the real kernel, the Python fallback produces wrong SSM states.
+llama.cpp with the GGUF model works correctly via `reasoning_format: "deepseek"` + `thinking_forced_open: true`. The server returns:
+- `content`: the visible answer (the interviewer question)
+- `reasoning_content`: the thinking chain
+No stripping needed — llama.cpp handles the `</think>` boundary.
+**Off-policy gap:** GGUF model generates, HF model trains. Same off-policy issue as v3. Acceptable for now; importance sampling correction is planned for v8.
+---
+## strip_thinking() — Correct Implementation
+```python
+def strip_thinking(text: str) -> str:
+    """Extract only the visible answer after </think>."""
+    if '</think>' in text:
+        idx = text.index('</think>') + len('</think>')
+        return text[idx:].strip()
+    elif '<think>' in text:
+        # Truncated mid-think — discard entirely
+        return ''
+    else:
+        # No thinking block — use as-is
+        return text.strip()
+```
+---
+## Reward Function
+```python
+def reward_fn(response: str, prompt_context: str) -> float:
+    """
+    5-component heuristic matching the eval_v2 scorer.
+    Scores 0–5, targeting Lex Fridman interviewer style.
+    """
+    score = 0.0
+    # 1. Is it a question? (required)
+    if not response.strip().endswith('?'):
+        return 0.0
+    score += 1.0
+    # 2. Single question only (no multi-part)
+    if response.count('?') > 2:
+        score -= 0.5
+    elif response.count('?') == 1:
+        score += 0.5
+    # 3. Length (20–60 words ideal)
+    words = len(response.split())
+    if 20 <= words <= 60:
+        score += 1.5
+    elif words < 20 or words > 120:
+        score += 0.0
+    else:
+        score += 0.75
+    # 4. Topical relevance (overlap with guest answer)
+    guest_words = set(prompt_context.lower().split()) - STOPWORDS
+    resp_words = set(response.lower().split())
+    overlap = len(guest_words & resp_words) / max(len(guest_words), 1)
+    score += min(overlap * 2.0, 1.5)
+    # 5. No filler openers ("That's fascinating", "Great point")
+    filler = ['that\'s fascinating', 'great point', 'interesting', 'that\'s a great',
+              'what a', 'absolutely', 'certainly', 'of course']
+    if any(f in response.lower()[:50] for f in filler):
+        score -= 0.5
+    return max(0.0, min(score, 5.0))
+```
+---
+## GRPO Loss
+```python
+# Advantages: normalize within completion group
+rewards = torch.tensor([reward_fn(strip_thinking(g), context) for g in generations])
+mean_r, std_r = rewards.mean(), rewards.std()
+advantages = (rewards - mean_r) / (std_r + 1e-8)
+# Skip step if all rewards identical (zero advantage = zero gradient = mode drift)
+if std_r < 0.01:
+    continue
+# Policy log-probs on visible tokens only (after </think>)
+log_probs = compute_log_probs(model, tokenized_completion, visible_mask)
+ref_log_probs = compute_log_probs(ref_model, tokenized_completion, visible_mask)
+# GRPO + KL
+policy_loss = -(advantages * log_probs).mean()
+kl_loss = (log_probs - ref_log_probs).mean()
+loss = policy_loss + kl_coef * kl_loss
+```
+**Reference model:** LoRA disabled via `model.disable_adapter_layers()` — same memory footprint as policy, no second copy needed.
+---
+## run10 Results (first 2 steps)
+```
+Step 0: reward mean=1.928 std=1.932
+  [gen 1] "When you say the self disappears during meditation,
+           how does that experience feel different from ordinary states of mind?" → 4.04/5
+Step 1: reward mean=0.980 std=1.697
+  [gen 3] "What do you think the most profound consequence of unregulated genetic
+           selection for intelligence might be, beyond the obvious?" → 3.92/5
+```
+Step times: ~89s (loss) + ~73s (generation) = ~162s/step.
+ETA for 500 steps: ~22h from launch (completes ~20:00 UTC Mar 27).
+---
+## Planned: GRPO v8 (on-policy)
+With `mamba-ssm` now installed (real CUDA kernel), the HF model should produce correct P(`</think>`). v8 will:
+1. Generate with HF model directly (fully on-policy)
+2. Drop llama.cpp server dependency
+3. Add importance sampling correction for any remaining distribution gap
+4. Mix 10% `enable_thinking=False` samples (NVIDIA recipe)

docs/GRPO_V8_CHANGES.md ADDED Viewed

	@@ -0,0 +1,55 @@

+# GRPO v8 — Speed Optimizations (2026-03-27)
+## Problem
+v8 train phase averaging **183s/step** due to 64 sequential forward passes.
+## Root Cause
+Both sweep 1 (ref logprobs) and sweep 2 (policy logprobs) called
+`compute_logprobs_batched(model, tok, [single_item], ...)` in a loop —
+wasting 32 kernel launch cycles each sweep.
+## Fixes Applied
+### 1. Sweep 1 — micro-batched (was 1-at-a-time)
+```python
+# Before: 32 sequential single-item calls
+for (pt, tids, _, _) in flat_items:
+    rlp = compute_logprobs_batched(model, tok, [pt], [tids], ...)
+# After: 4 micro-batches of 8
+for mb_start in range(0, n_total, cfg.micro_batch_size):
+    mb = flat_items[mb_start : mb_start + cfg.micro_batch_size]
+    rlps = compute_logprobs_batched(model, tok, [it[0] for it in mb], ...)
+```
+**Why not full batch=32?** Padding to max_seq_len=5000 would need 41GB activations
+(vs 81GB free). At typical lengths (~800 tokens), batch=8 → 1.7GB only.
+### 2. Sweep 2 — micro-batched (was 1-at-a-time)
+Same pattern. Gradients accumulate correctly across micro-batches since
+each item's loss is divided by n_total before `.backward()`.
+### 3. vLLM GPU utilization: 0.30 → 0.45
+Gives vLLM 58GB (up from 38GB) for better KV cache — faster long-seq generation.
+### 4. LoRA targets: +in_proj, +out_proj
+Adds 38 Mamba SSM layers to the trainable set (was only 4 attention layers + MLP).
+## Expected Impact
+| Phase | Before | After |
+|-------|--------|-------|
+| Train (sweep 1+2) | 183s | ~80s |
+| Generation | 152s | ~120s |
+| **Total/step** | **335s** | **~200s** |
+## Validation Trend (v8 run5, before optimization)
+- Step 80: 3.42/10
+- Step 90: 3.48/10
+- Step 100: 3.56/10
+- Step 110: 4.02/10 ← accelerating
+## Launch Command
+```bash
+cd /home/bobber/lex-ft
+tmux new-session -d -s grpo_v8 './launch_grpo_v8.sh grpo-v8-run6 2>&1'
+tmux attach -t grpo_v8
+```

docs/GRPO_V8_ONPOLICY_PLAN.md ADDED Viewed

	@@ -0,0 +1,320 @@

+# GRPO v8 — On-Policy Training Plan with vLLM
+Created: 2026-03-27
+Reference: [Nemotron 3 Nano RL Recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/nano-3-training/docs/nemotron/nano3)
+Status: Design phase (v7 run10 terminated — was off-policy, blocked)
+---
+## Why v7 Failed (Off-Policy)
+GRPO v7 used llama.cpp GGUF model for generation and HF BF16 model for training.
+These are different model representations with different probability distributions.
+Result: training rewards looked positive but were measuring the wrong model's distribution.
+This is the same off-policy gap that broke GRPO v3 in March. We knew about it. v7 was a stepping stone.
+**v8 closes this gap completely**: vLLM generates from the exact same HF weights being trained.
+---
+## NVIDIA's Approach (Nemotron 3 Nano Cookbook)
+From the [NeMo-RL GRPO recipe](https://docs.nvidia.com/nemo/rl/latest/guides/grpo.html):
+```
+Generate responses from the current policy using vLLM
+→ Evaluate using NeMo-Gym reward environments
+→ Compute group-relative advantages per prompt
+→ Update policy with clipped gradients (PPO-style)
+```
+**Key parameters from NVIDIA's production run:**
+- `num_prompts_per_step`: 128
+- `num_generations_per_prompt`: 16
+- `max_total_sequence_length`: 49152 (~49K)
+- `ratio_clip_min`: 0.2, `ratio_clip_max`: 0.28 (asymmetric)
+- `use_on_policy_kl_approximation`: true
+- `use_importance_sampling_correction`: true
+- `lr`: 3e-6 (lower than our lr=2e-5 — model is post-SFT when they run RL)
+- `normalize_rewards`: true
+- `use_leave_one_out_baseline`: true (variance reduction)
+- `token_level_loss`: true
+- `reference_policy_kl_penalty`: 0 (KL disabled — they use importance sampling instead)
+- Reasoning on/off: 10% of samples use `enable_thinking=False`
+**Our adaptations** (single GB10, LoRA, style task not verifiable):
+- 4–8 prompts/step (vs 128 — memory limited)
+- 4 generations/prompt (vs 16)
+- max_tokens: 4000 (vs 49K — our thinking chains are shorter)
+- lr: 3e-6 with 30-step warmup
+- KL penalty: 0.05 (light anchor — we don't have a GenRM)
+- Reference: disabled-LoRA (saves memory vs separate frozen model)
+---
+## Architecture
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    GRPO v8 Per-Step Loop                        │
+│                                                                 │
+│  1. GENERATE (vLLM, .venv-vllm)                                │
+│     vLLM loads current LoRA weights from shared path           │
+│     generate(prompts × 4, max_tokens=4000, enable_thinking=T)  │
+│     → returns: thinking_tokens, answer_tokens, token_ids        │
+│                                                                 │
+│  2. REWARD (.venv-train)                                        │
+│     reward_fn(answer) → score 0–5                              │
+│     advantages = (rewards - mean) / (std + 1e-8)              │
+│     skip step if std < 0.01 (mode collapse detection)         │
+│                                                                 │
+│  3. IMPORTANCE WEIGHTS (.venv-train)                            │
+│     vllm_logprobs = token log-probs from vLLM output          │
+│     policy_logprobs = HF model forward (current weights)       │
+│     ratio = exp(policy_logprobs - vllm_logprobs)              │
+│     (ratio ≈ 1.0 if vLLM and HF weights are in sync)         │
+│                                                                 │
+│  4. LOSS (.venv-train)                                          │
+│     clipped_ratio = clip(ratio, 1-ε, 1+ε)   [ε=0.2]         │
+│     policy_loss = -mean(min(ratio, clipped_ratio) × adv)      │
+│     kl_loss = mean(policy_logprobs - ref_logprobs)            │
+│     loss = policy_loss + 0.05 × kl_loss                       │
+│                                                                 │
+│  5. UPDATE (.venv-train)                                        │
+│     loss.backward(); optimizer.step()                          │
+│     save LoRA weights to shared path                           │
+└─────────────────────────────────────────────────────────────────┘
+```
+---
+## Implementation Plan
+### Phase 1: Shared Weight Protocol
+vLLM and HF model must share the same weights. Options:
+**Option A: Re-load vLLM each step** (simple, slow)
+- Save merged weights after each gradient update
+- Re-initialize vLLM `LLM()` from new checkpoint
+- Problem: 80s load time per step = impractical
+**Option B: vLLM weight sync API** (fast, complex)
+- vLLM 0.18 has `llm.llm_engine.model_executor.driver_worker.model_runner.model.load_weights()`
+- After each training step, sync LoRA deltas to vLLM in-place
+- No re-load needed — sub-second sync
+**Option C: Separate processes with weight file** (recommended)
+- Training process saves LoRA checkpoint to `/tmp/grpo_v8_lora/`
+- Generation process polls for new checkpoint, loads it before each generation
+- Clean separation: vLLM doesn't fight with training CUDA allocator
+- Step flow: train → save checkpoint → signal → generate → train...
+We'll use **Option C** for v8. Cleaner than fighting CUDA memory between vLLM and the optimizer.
+### Phase 2: Importance Sampling
+When vLLM and HF model are synced, `ratio ≈ 1.0` for all tokens. But there will always be a small discrepancy because:
+- vLLM uses its own Mamba-2 kernel (not HF's `NemotronH.forward`)
+- Different chunking/precision in SSM computation
+Following NVIDIA's recipe: use `use_importance_sampling_correction=True`.
+```python
+# vLLM returns token log-probs for each generated token
+vllm_logprobs = outputs[0].outputs[0].logprobs  # list of {token_id: logprob}
+# HF model recomputes log-probs on same token sequence
+hf_logprobs = compute_log_probs(model, token_ids)
+# Importance ratio
+ratio = torch.exp(hf_logprobs - vllm_logprobs.detach())
+# ratio = 1.0 for perfectly synced weights, slightly off otherwise
+```
+### Phase 3: Overlong Filtering
+From NVIDIA's recipe: *"Excludes sequences that hit max length without EOS from loss computation"*
+```python
+# Filter out truncated completions from loss
+hit_max = [len(o.outputs[0].token_ids) >= max_tokens for o in outputs]
+valid_mask = [not h for h in hit_max]
+# Only compute loss on completions that actually finished
+```
+### Phase 4: Reasoning On/Off Mixing
+From NVIDIA's recipe: *"Strip reasoning from 10% of samples"*
+```python
+# 10% of prompts use enable_thinking=False
+use_thinking = random.random() > 0.1
+prompt = tok.apply_chat_template(
+    msgs, add_generation_prompt=True, enable_thinking=use_thinking)
+```
+This teaches the model both modes simultaneously, preventing reasoning collapse.
+---
+## Training Script: grpo_v8_train.py
+### Key differences from v7:
+| Aspect | v7 | v8 |
+|---|---|---|
+| Generation | llama.cpp server (GGUF) | vLLM (.venv-vllm, HF weights) |
+| On-policy | ❌ (different model) | ✅ (same weights) |
+| Importance sampling | ❌ | ✅ |
+| PPO clipping | ❌ (vanilla PG) | ✅ (ε=0.2) |
+| Overlong filtering | ❌ | ✅ |
+| Thinking mixing | ❌ | ✅ (10% no-think) |
+| Weight sync | N/A | checkpoint every step |
+| Reference model | disabled-LoRA | disabled-LoRA (unchanged) |
+### Process architecture:
+```python
+# Two-process design
+# Process A: Training (.venv-train)
+#   - Loads HF model with LoRA
+#   - Receives generated token IDs from process B
+#   - Computes loss, updates weights
+#   - Saves LoRA delta to /tmp/grpo_v8_lora/step_N/
+# Process B: Generation (.venv-vllm)
+#   - Loads vLLM with current LoRA weights
+#   - Waits for generation requests (ZMQ socket or file polling)
+#   - Generates completions, returns token_ids + logprobs
+#   - Polls /tmp/grpo_v8_lora/ for weight updates after each step
+```
+Simpler alternative: **single-process sequential** (no parallelism, but correct):
+```
+for each step:
+  1. vLLM generate (subprocess call, returns JSON results)
+  2. Free vLLM GPU memory
+  3. Load HF model
+  4. Compute loss + update
+  5. Save LoRA checkpoint
+  6. Repeat
+```
+Memory cost: must load/unload model every step (~15s each). For 500 steps = ~2.5h overhead.
+**Recommended: two processes with shared memory or file-based IPC**. One process holds vLLM, other holds HF model. They communicate via files. CUDA contexts don't conflict (CUDA supports multiple contexts per GPU).
+---
+## Reward Function — No Changes Needed
+The v7 reward function works. The key insight from NVIDIA's recipe: they use *verifiable* rewards (math correctness, code execution). We can't do that for style tasks.
+Our heuristic (5 components, 0–5 scale) is the best we can do without a learned reward model. The baseline eval shows 4.35/5 for the base model — our reward ceiling.
+**Possible enhancement: GenRM-style reward** (future work)
+- Use the base model to score completions against the prompt
+- "Does this sound like Lex Fridman?" → scored by the model itself
+- Circular comparison: generate N, score each against the others
+- Expensive but avoids heuristic gaming
+---
+## Hyperparameters
+```yaml
+# grpo_v8_config.yaml — adapted from Nemotron 3 Nano cookbook
+generation:
+  num_prompts_per_step: 4          # vs NVIDIA's 128 (memory limited)
+  num_generations_per_prompt: 4    # vs NVIDIA's 16
+  max_tokens: 4000                 # full thinking chain budget
+  temperature: 1.0
+  top_p: 0.95
+  enable_thinking_fraction: 0.9   # 10% no-think mixing
+  overlong_filter: true           # exclude truncated from loss
+loss:
+  ratio_clip_min: 0.2             # from NVIDIA recipe
+  ratio_clip_max: 0.28            # asymmetric clipping
+  use_importance_sampling: true   # correct vLLM/HF mismatch
+  kl_coef: 0.05                   # light KL anchor
+  token_level_loss: true          # per-token normalization
+  use_leave_one_out_baseline: true # variance reduction
+optimizer:
+  type: AdamW
+  lr: 3e-6                        # from NVIDIA recipe (post-SFT RL)
+  min_lr: 3e-7
+  weight_decay: 0.0
+  clip_grad: 1.0
+  warmup_steps: 30
+lora:
+  r: 32
+  alpha: 64
+  targets: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
+  # Note: targets only attention layers (4/25 layers in this model)
+training:
+  max_steps: 500
+  val_period: 10
+  save_period: 50
+  zero_std_skip: true             # skip if all rewards identical
+```
+---
+## Validation Plan
+Every 10 steps, run 5 fixed eval prompts and log to W&B:
+```python
+EVAL_PROMPTS = [
+    ("Andrej Karpathy", "Neural networks are simple."),
+    ("Elon Musk", "I think the biggest risk is AI."),
+    ("A physicist", "Time might not be fundamental."),
+    ("A philosopher", "Free will is an illusion."),
+    ("A jazz musician", "The best songs write themselves."),
+]
+```
+Track: `eval/mean_score`, `eval/pct_questions`, `eval/mean_length`, `eval/think_end_rate`.
+**Success criteria**: `eval/mean_score > 4.0/5` consistently across steps → better than base model baseline.
+---
+## Expected Timeline
+| Step | Action | ETA |
+|---|---|---|
+| 0 | Write `grpo_v8_train.py` | 1–2h |
+| 1 | Smoke test (5 steps, check losses/rewards are sane) | 30min |
+| 2 | Full run (500 steps) | ~18h at ~130s/step |
+| 3 | Eval vs base model | 30min |
+| 4 | If >4.35/5: merge LoRA → GGUF → push to HF | 1h |
+---
+## Previous GRPO Postmortems
+| Version | Status | Root Cause |
+|---|---|---|
+| v3 | gibberish | Off-policy (llama.cpp gen + HF train), no KL |
+| v4 | oscillating loss | rmsnorm mock silently broke forward pass |
+| v5 | not trained | batched gen had EOS detection bug |
+| v6 runs 1–10 | collapse | gen_max_tokens 300–800 (mid-think truncation → reward=0) |
+| v7 run10 | stuck/blocked | Off-policy (llama.cpp GGUF ≠ HF BF16), stalled at step 1 |
+| **v8** | **planned** | On-policy vLLM + importance sampling correction |
+---
+## References
+- [Nemotron 3 Nano RL Guide](https://github.com/NVIDIA-NeMo/Nemotron/blob/nano-3-training/docs/nemotron/nano3/rl.md)
+- [NeMo-RL GRPO Documentation](https://docs.nvidia.com/nemo/rl/latest/guides/grpo.html)
+- [Tech Report Section 3.2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf)
+- [vLLM Setup Notes](./VLLM_SETUP_NOTES.md)
+- [GRPO v3 Postmortem](./GRPO_V3_POSTMORTEM.md) — the original off-policy diagnosis

docs/GRPO_V8_TRAINING_FLOW.md ADDED Viewed

	@@ -0,0 +1,184 @@

+# GRPO v8 — End-to-End Training Flow
+> **Run6 validation scores**: Step 10 → 6.82/10, Step 20 → 7.64/10
+> (vs run5 reaching 4.02/10 at step 110 — Mamba LoRA is the difference)
+---
+## High-Level Architecture
+Two models coexist in GPU memory every step:
+```
+┌─────────────────────────────────────────────────────────┐
+│  vLLM (38GB)          ──────→  fast completions         │
+│  frozen base weights  ──────→  + per-token log-probs    │
+└─────────────────────────────────────────────────────────┘
+           ↓ sync LoRA every step
+┌─────────────────────────────────────────────────────────┐
+│  HF model (8GB)       ──────→  ref logprobs (LoRA off)  │
+│  + LoRA (41M params)  ──────→  policy logprobs + grads  │
+└─────────────────────────────────────────────────────────┘
+```
+One training step = **8 prompts × 4 completions = 32 completions**.
+---
+## Phase 1: Generation (vLLM, ~80–140s)
+```
+prompt = [system: "You are Lex Fridman..."]
+       + [user: <guest's previous statement>]
+completions = vLLM.generate(prompt, n=4, max_tokens=4000, temp=0.9)
+# vLLM also returns per-token log-probs (used for IS correction):
+vllm_lp[t] = log P_vllm(token_t | token_1..t-1, prompt)
+```
+vLLM uses CUDA graphs → 150–250 tok/s vs ~30–50 tok/s from a plain HF model.
+---
+## Phase 2: Reward (CPU, ~1s)
+Each completion gets a scalar reward from `reward_v8.reward_fn_group()`.
+**Structural scoring only — no LLM judge, ungameable by keywords.**
+```
+r(completion) = base_score - penalties + diversity_bonus
+base_score:
+  + ends_with_question?    → +1.5   (Lex always asks questions)
+  + is_open_question?      → +0.5   (not yes/no binary)
+  + sufficient_length?     → +0.5   (≥8 content words)
+  + has_pivot?             → +0.5   (reframes guest's statement)
+  + lexical_diversity?     → +0.5   (unique content words)
+penalties:
+  - filler_opener?         → -2.0   ("that's fascinating", "great point"...)
+  - collapse_template?     → -3.0   ("as Lex Fridman", "the interview"...)
+  - yes_no_question?       → -1.5   (starts with "Is/Are/Was/Do...")
+  - parrots_prompt?        → -1.0   (>50% token overlap with prompt)
+diversity_bonus (group-level, across all 4 completions for same prompt):
+  + unique angles explored → up to +0.5
+```
+---
+## Phase 3: Sweep 1 — Reference Logprobs (HF, no grad)
+LoRA **disabled** → model acts as frozen reference policy:
+```
+for each completion (micro-batched, 8 at a time):
+    ref_lp[t] = log P_ref(token_t | context)   # no_grad
+```
+Memory: batch=8 × ~800 tokens × 42 layers × bf16 ≈ 1.7GB activations → safe.
+This is the **KL anchor** — prevents the policy from drifting too far from the base model.
+---
+## Phase 4: Sweep 2 — Policy Gradient (HF, with grad, 1-at-a-time)
+LoRA **enabled** → learner computes gradients:
+```
+policy_lp[t] = log P_θ(token_t | context)   # with grad
+```
+**Step 1: Advantage (leave-one-out baseline)**
+```
+A_i = (r_i - mean_{j≠i}(r_j)) / (std(r_1..N) + ε)
+```
+Centering within the group of 32 completions normalizes away prompt difficulty.
+**Step 2: Importance Sampling (IS) ratio**
+Corrects for the fact that vLLM generated from step-N weights while we're now at step-N+ε:
+```
+ρ_t = exp(policy_lp[t] - vllm_lp[t])
+ρ_t = clamp(ρ_t, 0, 10)
+```
+- ρ ≈ 1.0 → policies agree, use gradient as-is
+- ρ >> 1 → current policy favors this token more → clip
+- ρ << 1 → policies drifted apart → downweight
+**Step 3: Asymmetric PPO clipping** (NVIDIA RL cookbook)
+```
+ρ̃_t = clamp(ρ_t, 1 - ε_low, 1 + ε_high)
+       where ε_low=0.2, ε_high=0.28  # asymmetric: more conservative going up
+pg_loss_i = -mean_t[ min(ρ_t · A_i,  ρ̃_t · A_i) ]
+```
+**Step 4: KL penalty**
+```
+kl_loss_i = mean_t[ policy_lp[t] - ref_lp[t] ]
+```
+**Step 5: Per-completion loss and accumulation**
+```
+L_i = pg_loss_i + β · kl_loss_i       # β = 0.05
+L_total = (1/N) · Σ_i L_i             # N=32
+(L_i / N).backward()   # accumulate gradient per completion
+optimizer.step()        # AdamW 8-bit
+```
+---
+## Why It's "On-Policy"
+After each optimizer step, the updated LoRA weights are synced into vLLM:
+```
+vllm.load_weights(adapter_model.safetensors)
+```
+So generation at step N+1 uses the policy from step N. The IS ratio `ρ_t` measures the drift — at run6 step 0 it showed `ratio_mean=0.9991, clipped=0.5%`, confirming near-perfect on-policy behavior.
+---
+## Comparison: lex-ft v8 vs routangseng phase8
+| | routangseng phase8 (TRL GRPOTrainer) | lex-ft v8 |
+|---|---|---|
+| **Generation model** | Same HF model (all weights) | vLLM (frozen base, no LoRA) |
+| **Generation speed** | ~30–50 tok/s | **150–250 tok/s** (CUDA graphs) |
+| **Training model** | Same model (LoRA on) | Separate HF model (LoRA on) |
+| **LoRA targets** | q/k/v/o/gate/up/down (attn+MLP) | + **in_proj/out_proj (38 Mamba layers)** |
+| **Trainable params** | ~0.4% | **1.03%** (41M params) |
+| **Reference policy** | Implicit in TRL | Explicit: LoRA disabled on same model |
+| **IS correction** | None (always on-policy) | `ρ_t = exp(policy_lp − vllm_lp)` |
+| **Architecture** | Transformer (standard) | Hybrid Mamba-2 + 4 Attention layers |
+**The key difference in practice:** routangseng's LoRA only trained the 4 attention layers and MLP projections — the 38 Mamba SSM layers (which handle ~90% of the sequence processing) were frozen. lex-ft v8 adds `in_proj`/`out_proj` to reach those layers, giving the recurrent state space model a chance to actually learn the interviewer style. This is why run6 reached **7.64/10 at step 20** vs run5 reaching **4.02/10 at step 110**.
+---
+## What IS (Importance Sampling) Means
+IS bridges the gap between the "generator" (vLLM, slightly stale) and "learner" (HF+LoRA, current):
+```
+ρ_t = P_current(token_t) / P_vllm(token_t)
+```
+Without IS, you'd be treating stale completions as if they were generated by the current policy — which can cause instability when LoRA updates are large. The PPO clipping then ensures we don't take gradient steps larger than the data supports.
+---
+*Last updated: 2026-03-28 | Run6 step 27 ongoing*

docs/KAGGLE_VS_OURS_COMPARISON.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# Kaggle Notebook vs Our Approach — Deep Comparison
+## Key Difference: 30B vs 4B Model
+The Kaggle competition uses **30B-A3B** (`nemotron-3-nano-30b-a3b-bf16`).
+We use **4B** (`NVIDIA-Nemotron-3-Nano-4B-BF16`).
+This matters because:
+```
+30B pattern: MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME
+4B pattern:  M-M-M-MM-M-M*-M-M*-M-M-M*-M-M-MM*-MMM-M-M-
+```
+- 30B uses `M` (mamba), `E` (moe), `*` (attention) — all recognized by both transformers 4.x custom code AND transformers 5.x native code
+- 4B uses `-` (mlp) — ONLY recognized by transformers 4.x custom code. Native transformers 5.x chokes on it.
+**This is why the Kaggle code works with transformers 5.x and ours doesn't.** It's not a code difference — it's a model architecture difference.
+## Detailed Comparison
+| Aspect | Kaggle (dennisfong) | Our Approach |
+|--------|-------------------|--------------|
+| **Model** | 30B-A3B (52 layers) | 4B (42 layers) |
+| **transformers** | 5.x (Kaggle env) | 4.48.3 (pinned) |
+| **torch** | 2.10.0 | 2.11.0+cu130 |
+| **GPU** | Kaggle (Blackwell) | GB10 (Blackwell SM 12.1) |
+| **rmsnorm_fn** | Pure PyTorch mock | Pure PyTorch mock (same approach) |
+| **is_fast_path_available** | Forced False after load | Mock makes it False at import |
+| **causal_conv1d** | Not explicitly mocked | Mocked |
+| **Training method** | SFTTrainer (trl 0.24) | Custom GRPO loop |
+| **LoRA rank** | 32 | 64 |
+| **LoRA alpha** | 16 | 256 |
+| **Learning rate** | 2e-4 | 5e-6 |
+| **Max seq len** | 1024 | 200 (max_new_tokens) |
+| **Gradient checkpointing** | Yes (use_reentrant=True) | No |
+| **trust_remote_code** | True | True |
+## Kaggle's rmsnorm_fn — Simpler Than Ours
+```python
+# Kaggle version — does NOT handle group_size or norm_before_gate correctly
+def _pure_rmsnorm_fn(x, weight, bias=None, z=None, eps=1e-5,
+                     group_size=None, norm_before_gate=True, upcast=True):
+    dtype = x.dtype
+    if upcast: x = x.float()
+    variance = x.pow(2).mean(-1, keepdim=True)
+    x_normed = x * torch.rsqrt(variance + eps)
+    out = x_normed * weight.float()
+    if bias is not None: out = out + bias.float()
+    if z is not None: out = out * F.silu(z.float())  # always gate AFTER norm
+    return out.to(dtype)
+```
+Problems with Kaggle's version:
+1. **Ignores `group_size`** — does full-dimension RMSNorm even when group normalization is requested
+2. **Ignores `norm_before_gate` parameter** — always applies gate after norm, ignoring the flag
+3. Works for the 30B because the 30B may not use group_size, or the error is small enough for SFT
+Our version handles both correctly (from the mamba_ssm reference implementation).
+## Kaggle's Key Trick: Post-Load Patch
+```python
+# Force slow path AFTER model is loaded
+for name, mod in sys.modules.items():
+    if "modeling_nemotron_h" in name:
+        mod.is_fast_path_available = False
+```
+This is important because with `trust_remote_code=True`, the model code imports `mamba_ssm` at module level. If the import succeeds (even with a mock), `is_fast_path_available` may be set to `True` if all the functions are non-None. The Kaggle code forces it False AFTER loading to ensure `torch_forward` is used.
+## Kaggle's Environment
+Key detail: Kaggle provides a **custom Python environment** with specific Blackwell GPU patches:
+- Custom ptxas-blackwell binary
+- Triton backend patches
+- Pre-installed torch 2.10.0 with cu128
+- The Kaggle environment has `mamba_ssm` and `causal_conv1d` pre-installed (they just bypass the Triton kernels)
+This is NOT a vanilla transformers 5.x setup — it's a Kaggle-specific environment with hardware-specific patches.
+## What We Should Take From Kaggle
+1. ✅ **Pure PyTorch rmsnorm_fn** — already doing this
+2. ✅ **Force `is_fast_path_available = False`** — should add explicitly
+3. 🔄 **Gradient checkpointing with `use_reentrant=True`** — would save memory
+4. 🔄 **SFTTrainer from trl** — cleaner than our custom loop for SFT tasks
+5. ❌ **Their rmsnorm ignoring group_size** — we should NOT copy this (ours is more correct)
+## Recommendation
+**Stick with transformers 4.48.3 for the 4B model.** Here's why:
+1. The 4B model's `-` pattern is only understood by the custom code in transformers 4.x
+2. Transformers 5.x native NemotronH doesn't support `mlp` as a block type
+3. The Kaggle notebook works with 5.x because the 30B model has a different pattern
+4. Our 4.48.3 setup produces correct outputs (CE loss 3.88, coherent generation)
+5. Training is already running and producing results
+**If we wanted transformers 5.x**, we'd need to:
+- Modify the 4B model's `config.json` to change the pattern format
+- OR fix native transformers to support `mlp` block type
+- Both are more work than just using 4.48.3
+The Kaggle approach is valid for the 30B but doesn't apply to our 4B model.

docs/LEXFRIDMAN_INTERVIEWER_PLAN.md ADDED Viewed

	@@ -0,0 +1,183 @@

+# Lex Fridman AI Interviewer — Project Plan
+## Goal
+Fine-tune **NVIDIA Nemotron 3 Nano 4B** (a hybrid Mamba-2 + Attention architecture) into an AI interviewer that conducts conversations in the style of Lex Fridman. The model should:
+1. **Ask thoughtful, concise follow-up questions** — not lecture, summarize, or monologue
+2. **Reference the guest's expertise and prior statements** — show it's actually listening
+3. **Stay brief** — Lex's questions are typically 30–80 words, not paragraphs
+4. **Avoid filler** — no "Great question!", no generic transitions
+5. **Match Lex's intellectual curiosity** — deep, sometimes philosophical, always genuine
+The target is a locally-deployable GGUF model served via llama.cpp that outperforms the base model on our interviewer eval benchmark.
+## Eval Leaderboard
+### v2 Eval (0-10 scale, 25 scenarios, 10 dimensions)
+| # | Model | Score | Notes |
+|---|-------|-------|-------|
+| 1 | **Nemotron 4B base** | **6.90/10** | No training — still the best |
+| 2 | Full SFT v4 (2ep, masked) | 5.54/10 | Best SFT — completion-only loss masking |
+| 3 | Full SFT v5 (3ep, packed) | 5.17/10 | 3rd epoch hurt (overfitting) |
+| 4 | Full SFT v3 (2ep, v4 data) | 3.96/10 | No loss masking → user prefix contamination |
+### v1 Eval (0-5 scale, 20 scenarios, 5 dimensions)
+| # | Model | Score |
+|---|-------|-------|
+| 1 | Nemotron 4B base Q8 | 4.35/5 |
+| 2 | GPT-5.4 | 4.30/5 |
+| 3 | Full SFT v4 | 3.85/5 |
+| 4 | Full SFT v2 | 3.20/5 |
+| 5 | Full SFT v3 | 2.50/5 |
+| 6 | LoRA SFT v1 | 2.10/5 |
+| 7 | Full SFT v1 | 2.00/5 |
+## SFT Verdict: Dead End
+**SFT causes catastrophic forgetting on this model.** It learns surface patterns (brevity +0.8, no filler +0.2) but destroys deeper capabilities:
+- guest_reference: 6.0 → 1.5 (-4.5)
+- specificity: 7.0 → 3.1 (-3.9)
+- interview_flow: 8.2 → 6.4 (-1.8)
+- depth: 6.1 → 4.7 (-1.4)
+3rd epoch made it worse (5.54 → 5.17), confirming overfitting on 6,335 samples.
+## Current Phase: GRPO (Reinforcement Learning)
+### Strategy
+- Start from **base model** (6.90/10) — preserve all strengths
+- Use **LoRA targeting ALL 42 layers** (not just 4 attention layers)
+- Reward function targets weak dimensions without destroying strong ones
+- Anti-gaming measures prevent reward hacking
+### LoRA Configuration (rank=32)
+**Key finding: LoRA can target ALL projections across all 42 layers, not just attention.**
+| Module | Layers | Shape | Dim Sum |
+|--------|--------|-------|---------|
+| in_proj (Mamba) | 21 | (17504, 3136) | 433,440 |
+| out_proj (Mamba) | 21 | (3136, 7680) | 227,136 |
+| up_proj (MLP) | 17 | (12544, 3136) | 266,560 |
+| down_proj (MLP) | 17 | (3136, 12544) | 266,560 |
+| q/k/v/o_proj (Attn) | 4 | various | 99,328 |
+| **Total dim_sum** | | | **1,293,024** |
+- **rank=32**: 41,376,768 trainable params (1.04% of model)
+- **alpha=64** (2× rank)
+- Optimizer memory: 0.33 GB (vs 31.8 GB for full fine-tune!)
+- Est. peak memory: ~58 GB (safe for 128 GB DGX Spark)
+Previous LoRA SFT failed because it only targeted q/k/v/o_proj = 4 layers. With in_proj/out_proj, LoRA now touches the Mamba layers too.
+### Reward Function v3
+4 components, validated against base model outputs (Pearson r=0.58 with eval v2):
+| Component | Weight | What it measures |
+|-----------|--------|-----------------|
+| R1: Gate | 30% | Clean turn (not meta, not user-prefix, not empty) |
+| R2: Question | 35% | Single focused question, ≤60 words, at end of response |
+| R3: Guest ref | 20% | Builds on guest content (word overlap, capped to prevent parrot gaming) |
+| R4: Penalties | 15% | Filler, generic, repetition, formulaic patterns |
+Anti-gaming measures:
+- Guest word overlap capped at 6 (prevents copy-paste)
+- Question must be in last sentence (prevents question-stuffing then rambling)
+- Parrot detection: penalizes if >50% response words are from guest
+- Long contiguous phrase detection (5+ gram copy)
+- Keyword stuffing penalty (3+ depth buzzwords in <30 words)
+Validation results:
+- Meta/CoT responses: -0.14 avg (correctly penalized)
+- Clean responses: +0.34 avg
+- Range: [-0.88, +0.68] (wide spread for GRPO)
+- High reward responses score 7.3/10 on eval vs 6.0/10 for low reward
+### GRPO Training Config
+```
+Model:          Nemotron 3 Nano 4B (base, fresh)
+Method:         GRPO with LoRA
+Reward:         v3 (validated, anti-gaming)
+LoRA rank:      32 (1.04%, 41.4M params)
+LoRA alpha:     64
+LoRA targets:   in_proj, out_proj, up_proj, down_proj, q_proj, k_proj, v_proj, o_proj
+LR:             5e-5 (10× full-FT per tinker-cookbook guidance, conservative for GRPO)
+Beta (KL):      0.04
+Batch size:     2
+Grad accum:     4 (eff batch = 8)
+Num generations: 2
+Max prompt:     384
+Max completion: 128
+Temperature:    0.9
+Prompts:        512
+Epochs:         1
+Est. peak mem:  ~58 GB
+```
+## Data
+### v4 Dataset (for SFT — completed)
+- 6,335 segments from 108 guests, 113 episodes
+- Quality filtered: min score 3, 100% ends-on-assistant, 100% last-asks-question
+- Source: `data/interview_segments_v4.jsonl` (pre-formatted), `data/interview_segments_v4_messages.jsonl` (raw messages)
+### GRPO Prompts
+- Derived from v4 messages: system + last user message per segment
+- 512 prompts sampled for calibration run
+- Full 6,335 available for main training
+## Architecture Notes
+### Nemotron 3 Nano 4B Hybrid Architecture
+- 42 layers total: 38 Mamba-2 + 4 Attention (layers 12, 17, 24, 32)
+- 3,973,556,832 parameters
+- Projection modules per type:
+  - Mamba: `in_proj` (17504→3136), `out_proj` (3136→7680) — 21 layers each
+  - MLP: `up_proj` (12544→3136), `down_proj` (3136→12544) — 17 layers each
+  - Attention: `q_proj` (5120→3136), `k_proj` (1024→3136), `v_proj` (1024→3136), `o_proj` (3136→5120) — 4 layers each
+### DGX Spark (GB10) Environment
+- 128 GB unified memory (CPU+GPU shared LPDDR5X)
+- ~273 GB/s memory bandwidth
+- CUDA 12.1, Compute 12.1 (Blackwell)
+- Full fine-tune GRPO OOMs (~112 GB peak) — must use LoRA
+- Unsloth quirks: forces `remove_unused_columns=True` (patched), no Flash Attention 2
+## Technical Lessons
+### SFT
+1. **Completion-only loss masking is critical** for multi-turn chat — without it, model learns to predict user role tokens
+2. **Chat template must match inference format** — `<think></think>` tags in training data
+3. **Data quality > quantity** — filtering from 7,580 → 6,335 improved quality metrics
+4. **3 epochs overfits** on 6,335 samples — 2 is the max for this dataset size
+5. **Packing doesn't speed up on GB10** — compute-bound, not memory-bound
+### LoRA
+1. **Target ALL projection types** on hybrid architectures — not just attention
+2. **Previous LoRA failures were from targeting only 4/42 layers**, not from LoRA being incompatible
+3. **LoRA parameter count formula**: `rank × Σ(shape[0] + shape[1])` per tinker-cookbook
+### GRPO
+1. **Full fine-tune GRPO OOMs on DGX Spark** — policy + reference + optimizer + scoring = ~112 GB
+2. **LoRA GRPO is ~58 GB** — optimizer drops from 31.8 GB to 0.33 GB
+3. **Reward validation against eval is essential** before training
+4. **Reward needs anti-gaming** — LLMs will exploit keyword matching, parroting, formulaic patterns
+5. From routangseng: reward bounded [-1, 1], few strong signals > many weak ones, `prompt` column must be pre-formatted text
+## Timeline
+- **Mar 18**: Project started, data crawl from 113 episodes
+- **Mar 19**: First LoRA SFT, environment debugging
+- **Mar 20**: 7-model eval comparison, full SFT v1-v2, v2-v4 datasets
+- **Mar 21 AM**: SFT v3-v5 experiments — SFT declared dead end
+- **Mar 21 PM**: Eval v2 built (10 dimensions, 25 scenarios), DGX Spark optimization investigation
+- **Mar 21 EVE**: GRPO setup — reward v3 validated, OOM diagnosed, LoRA solution confirmed
+- **Mar 22**: GRPO LoRA training with rank=32, all projections
+---
+*Last updated: 2026-03-22*

docs/LLAMA_FINETUNE_INVESTIGATION.md ADDED Viewed

	@@ -0,0 +1,44 @@

+# llama-finetune Investigation for Nemotron-3-Nano-4B
+> Date: 2026-03-24
+## Two Bugs Found
+### Bug 1: Buffer Underflow (small input files)
+- **Location**: `common/common.cpp:1696`
+- **Code**: `ndata = (tokens.size() - ne_datapoint - 1) / stride`
+- **Cause**: When `tokens.size() < ne_datapoint + 1` (input shorter than context), `ndata` underflows to near-max uint64 (~18.4 exabytes allocation)
+- **GitHub Issue**: [#15139](https://github.com/ggml-org/llama.cpp/issues/15139) (Aug 2025, open, no fix)
+- **Fix**: Trivial — add `if (tokens.size() <= ne_datapoint + 1) { error("input too short"); }`
+- **Workaround**: Use input file with more tokens than context length ✅ (we verified this works)
+### Bug 2: Backward Pass Assert Failure (fundamental)
+- **Location**: `ggml/src/ggml.c:6998`
+- **Assert**: `!node->view_src || node->op == GGML_OP_CPY || GGML_OP_VIEW || GGML_OP_RESHAPE || GGML_OP_PERMUTE || GGML_OP_TRANSPOSE`
+- **Cause**: The NemotronH forward graph uses view tensors with operations not in the backward pass whitelist. Likely from the Mamba-2 SSM scan operation, which uses views for state manipulation.
+- **GitHub Issue**: [#15279](https://github.com/ggml-org/llama.cpp/issues/15279) (Aug 2025, open, no fix) — same assert for Saiga Nemo 12B
+- **Fix**: Non-trivial — requires adding backward pass support for SSM operations in GGML
+- **Workaround**: None. **llama-finetune cannot train Mamba/NemotronH models.**
+## Reproduction
+```bash
+# Bug 1: small input
+echo "Hello" > /tmp/small.txt
+llama-finetune -m nemotron-f16.gguf -f /tmp/small.txt -c 64
+# → ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 18446744073709547520
+# Bug 2: proper input (bypasses Bug 1)
+# Use 150K char training file
+llama-finetune -m nemotron-f16.gguf -f /tmp/train_proper.txt -c 128 -ngl 99
+# → GGML_ASSERT(!node->view_src || ...) at ggml.c:6998
+# → Crash in ggml_build_backward_expand → ggml_opt_build → llama_context::opt_epoch_iter
+```
+## Conclusion
+**llama-finetune does not support Mamba-2 / NemotronH architecture for training.**
+The backward pass graph builder in GGML cannot handle the SSM operations used by Mamba-2 layers. This is an upstream limitation, not a configuration issue. Both bugs are open on GitHub with no PRs or fixes as of 2026-03-24.
+llama.cpp supports NemotronH for **inference only** (merged Dec 2025, [PR #18058](https://github.com/ggml-org/llama.cpp/pull/18058)). Training support would require implementing backward passes for the SSM-specific GGML operations.

docs/LORA_V1_ANALYSIS.md ADDED Viewed

	@@ -0,0 +1,109 @@

+# LoRA v1 vs SFT v5 — Deep Analysis
+*Generated: 2026-03-30*
+## Results
+| Model | Score | 0/3 | 1/3 | 2/3 | 3/3 | on_topic | uses_guest | probing |
+|-------|-------|-----|-----|-----|-----|----------|------------|---------|
+| Base | 0.653 | 8% | 28% | 24% | 40% | 68% | 48% | 80% |
+| SFT-v5 (LoRA≈16, hidden) | 0.667 | 16% | 16% | 20% | 48% | 76% | 60% | **64%** ← damaged |
+| LoRA-v1 (r=64, explicit) | 0.733 | 4% | 20% | 28% | 48% | 72% | 56% | **92%** ← best |
+## Finding 1: SFT v5 Was Not a True Full Fine-Tune
+`full_finetuning=True` was set, but Unsloth silently fell back to LoRA for NemotronH.
+Training log showed 10.1M/2.66B (0.38%) trainable — equivalent to r≈16 LoRA.
+LoRA v1: 40.5M/4.01B (1.01%) trainable — explicit r=64. **Both were LoRA.**
+## Finding 2: Three Factors Drove the Performance Gap
+**A — Rank (r≈16 implicit → r=64 explicit)**
+4x more trainable params → richer task subspace. r=64 can capture topic-tracking,
+guest-reference, and question depth simultaneously. r~16 can only shift broad style.
+**B — Overfitting from 3 epochs at low capacity**
+SFT v5 ran 897 steps (3 epochs). At 0.38% capacity, it memorized surface question
+patterns ("How do you...") without preserving depth. Probing collapsed 80%→64%.
+LoRA v1 at 1 epoch (299 steps) reinforced the base instinct lightly: probing 80%→92%.
+**C — LR mismatch (1e-5 vs 2e-4)**
+LR=1e-5 is appropriate for full fine-tuning. For LoRA adapters starting from random,
+it's too low → slow, shallow adaptation. LR=2e-4 is the correct scale for LoRA r=64.
+## Finding 3: The Probing Dimension Is the Key Discriminator
+- `on_topic`: mostly in the frozen 2.66B backbone, marginally trainable
+- `uses_guest`: strongly in training signal (96% of pairs have word overlap), both models learned it
+- `probing`: **the critical one** — base model is already good at it (80%); 3 epochs of SFT destroyed it; 1 epoch of LoRA improved it
+## Finding 4: LoRA v1's 3 Failures Are All `uses_guest` Regressions
+Pattern: when guest uses domain-specific jargon (graph/constraint, meta-prompt, organisms),
+LoRA generalizes the concept but loses the exact vocabulary. Training data coverage for
+abstract/technical domains may be thinner. These are edge cases — 3/25 prompts.
+## Finding 5: Training Data Is Well-Suited for LoRA
+- 4,772 pairs: 697 real Lex + 4,075 generated
+- 96% have ≥1 guest word in the question
+- Avg 2.4 word overlap → strong uses_guest signal
+- Mean question length: 16.1 words (models produce 14.8-15.2) — well matched
+- Signal is learnable in 1 epoch at r=64. More epochs = surface memorization.
+## Config Comparison
+| | SFT v5 | LoRA v1 |
+|---|---|---|
+| Trainable params | 10.1M (0.38%) | 40.5M (1.01%) |
+| Effective rank | ~16 (implicit) | 64 (explicit) |
+| LR | 1e-5 | 2e-4 |
+| Epochs | 3 | 1 |
+| Steps | 897 | 299 |
+| Dropout | 0.05 | 0 (full Unsloth fusion) |
+| Batch (eff) | 24 | 16 |
+## Next Step Recommendations
+1. **uses_guest gap (56% vs 60% for SFT-v5)**: Try LoRA v2 with more aggressive
+   vocabulary-echo examples in training data, or train on real-Lex-only (697 pairs)
+   to see if generated pairs are diluting the exact-vocab signal.
+2. **Probing is near ceiling (92%)**: The bottleneck is now uses_guest.
+   Getting uses_guest from 56% to 70%+ with probing maintained would push score ~0.80.
+3. **GRPO from LoRA v1**: Use reward_v11 with LoRA v1 as starting checkpoint.
+   GRPO can directly optimize the uses_guest×probing joint objective.
+---
+## LoRA v2 Results (2026-03-30) — Filter + Upsample Experiment
+### Config
+- Dataset: `data/sft_v6_train.jsonl` (6,933 pairs)
+  - Removed 1,324 generic-opener generated pairs (32% of generated)
+  - Upsampled real Lex 6× → 4,182 effective real examples (60% of training)
+  - Generic opener rate: 0% (was 26% in v5)
+- Same LoRA: r=64, LR=2e-4, 1 epoch, gradient_checkpointing="unsloth"
+### Results
+| Model | Score | on_topic | uses_guest | probing |
+|-------|-------|----------|------------|---------|
+| Base | 0.653 | 68% | 48% | 80% |
+| **LoRA v1** (r=64, original data) | **0.733** | 72% | 56% | 92% |
+| LoRA v2 (filtered+upsampled) | 0.640 | 64% | **48%** | 80% |
+### Finding: Filtering Didn't Help uses_guest
+uses_guest: 48% in LoRA v2 — **identical to base, same as before training**.
+The filtering + upsampling hypothesis was wrong, or at least insufficient.
+**Why it failed:**
+1. The template contamination theory predicted filtering generic openers would let the model learn vocab-echo. But the model's uses_guest behavior didn't change at all.
+2. Upsampling real Lex 6× didn't help either — the 697 real pairs have mean overlap of only 1.76 (lower than generated 2.21), so more real Lex doesn't automatically mean more vocab-echo.
+3. The fundamental issue: the model already "knows" how to reference guest vocabulary (it does so 48% of the time from base). The bottleneck is not training signal — it's something deeper in how the model decides when to echo vs generalize.
+**Conclusion:** Data-side interventions (filtering, upsampling, prompt engineering) cannot push uses_guest beyond 48-56%. The mechanism is encoded in model weights at a level that SFT can only partially access via 4 attention-layer LoRA.
+**LoRA v1 (0.733) remains best.** The path forward is reward_v12 GRPO from the LoRA v1 checkpoint — directly suppressing template probability via negative advantage signal.

docs/LORA_V2_NATIVE_RESULTS.md ADDED Viewed

	@@ -0,0 +1,66 @@

+# LoRA v2 Native — Correct Kernel Forward Path (2026-04-02)
+## Summary
+**Score: 0.760** — Best ever on functional eval. First fine-tune to clearly improve `uses_guest` without damaging `probing`.
+## The Kernel Fix
+All previous NemotronH training on GB10 (Blackwell SM 12.1) used a broken forward path:
+- NVIDIA's custom `modeling_nemotron_h.py` (via `trust_remote_code=True`) hardcoded `is_fast_path_available = False`
+- This forced the naive `torch_forward` SSM scan which compounds numerical errors across 42 layers
+- Result: PPL ~2,126 vs vLLM's ~10-20 on the same weights
+**Fix:** Use native transformers 5.3.0 built-in NemotronH implementation with 3 patches:
+1. **Config validator**: add `"mlp"` to valid block types (4B model has plain MLP layers, not MoE)
+2. **MIXER_TYPES**: map `"mlp"` → `NemotronHMLP`
+3. **block_type_to_mask**: add `"mlp": None`
+4. **Config format**: convert `hybrid_override_pattern` string → `layers_block_type` list
+The native implementation uses `cuda_kernels_forward` with `mamba_chunk_scan_combined` — production Triton kernels that work correctly on SM 12.1.
+## Training Config
+| Parameter | Value |
+|-----------|-------|
+| Base model | NVIDIA-Nemotron-3-Nano-4B |
+| Forward path | Native transformers `cuda_kernels_forward` |
+| LoRA rank | 64 |
+| LoRA alpha | 128 |
+| Target modules | q/k/v/o/gate/up/down_proj |
+| Dataset | `sft_v5_train.jsonl` (4,772 pairs) |
+| Epochs | 1 |
+| LR | 2e-4 (cosine, 30-step warmup) |
+| Batch size | 2 × 8 = 16 |
+| Thinking | Disabled (`enable_thinking=False`) |
+| Runtime | 12 min 35 sec (299 steps) |
+| Avg train loss | **1.289** |
+| W&B | `lex-sft-lora-v2-native-r64` (run `qlt9jzdc`) |
+## Eval Results (3-judge functional)
+| Model | Score | on_topic | uses_guest | probing |
+|-------|-------|----------|------------|---------|
+| Base | 0.753 | — | 58% | 80% |
+| LoRA v1 (broken path) | 0.733 | — | 56% | 92% |
+| LoRA v2 old (broken path) | 0.667 | — | 44% | 64% |
+| **LoRA v2 native** | **0.760** | **80%** | **68%** | **80%** |
+## Key Findings
+- **uses_guest: 68%** — +10pp over base, +24pp over broken LoRA v2
+- **probing: 80%** — stable (broken LoRA v2 destroyed this to 64%)
+- Train loss 1.289 vs 21.86 on broken path — the model was actually learning meaningful patterns
+- Correct forward path is necessary for any training on GB10 Blackwell
+## Files
+- Training script: `scripts/train_sft_lora_v2_native.py`
+- Launch script: `run_sft_lora_v2_native.sh`
+- Adapter: `lora/sft-lora-v2-native/`
+- Native config: `models/NVIDIA-Nemotron-3-Nano-4B/config_native.json`
+- Patched transformers files (in `.venv-train`):
+  - `transformers/models/nemotron_h/configuration_nemotron_h.py`
+  - `transformers/models/nemotron_h/modeling_nemotron_h.py`

docs/MAMBA_SSM_BUILD_NOTES.md ADDED Viewed

	@@ -0,0 +1,114 @@

+# mamba-ssm Build Notes — DGX Spark (GB10, aarch64)
+Date: 2026-03-26
+Status: ✅ Successfully installed
+---
+## Problem
+On the DGX Spark, `pip install mamba-ssm` fails with:
+```
+RuntimeError: The detected CUDA version (12.0) mismatches the version that was
+used to compile PyTorch (13.0). Please make sure to use the same CUDA versions.
+```
+PyTorch in `.venv-train` is compiled for CUDA 13.0, but `/usr/bin/nvcc` points to a CUDA 12.0 toolchain:
+```bash
+$ which nvcc
+/usr/bin/nvcc
+$ nvcc --version
+Cuda compilation tools, release 12.0, V12.0.140
+```
+The DGX Spark actually has CUDA 13.0 at `/usr/local/cuda-13.0/`. The default symlink `/usr/local/cuda` → `/etc/alternatives/cuda` resolves to the wrong version.
+---
+## Fix
+Override `CUDA_HOME` and `PATH` to point at the correct CUDA version:
+```bash
+cd /home/bobber/lex-ft
+source .venv-train/bin/activate
+CUDA_HOME=/usr/local/cuda-13.0 \
+PATH=/usr/local/cuda-13.0/bin:$PATH \
+TORCH_CUDA_ARCH_LIST="12.0" \
+pip install mamba-ssm causal-conv1d --no-build-isolation
+```
+Notes:
+- `TORCH_CUDA_ARCH_LIST="12.0"` — targets GB10 Blackwell (SM 12.1)
+- `--no-build-isolation` — uses the venv's torch for cpp_extension compatibility
+- Build takes ~45 minutes (9 CUDA kernel files for mamba-ssm, aarch64 compilation is slower)
+---
+## Installed Versions
+```
+mamba_ssm-2.3.1-cp312-cp312-linux_aarch64.whl  (351 MB)
+causal_conv1d-1.6.1
+```
+Cached at: `~/.cache/pip/wheels/28/83/54/d45107838fec575b93f5d723f56351cee19a1b13bcd4ec9f3f`
+Future reinstalls in the same venv will use the cached wheel (no recompile).
+---
+## Verification
+```python
+import causal_conv1d
+print(causal_conv1d.__version__)           # 1.6.1
+print(causal_conv1d.causal_conv1d_fn)      # <function causal_conv1d_fn at 0x...>
+import mamba_ssm
+print(mamba_ssm.__version__)               # 2.3.1
+from mamba_ssm.ops.triton.selective_state_update import selective_state_update
+print(selective_state_update)              # <function selective_state_update at 0x...>
+```
+Both are non-None — the fast CUDA path is active.
+After import, the Nemotron model will no longer print:
+```
+WARNING: The fast path is not available because one of
+(selective_state_update, causal_conv1d_fn, causal_conv1d_update) is None.
+Falling back to the naive implementation.
+```
+---
+## What This Fixes
+Without `selective_state_update`, the decode step falls back to Python with BF16 arithmetic. This produces wrong SSM states vs training conditions, causing P(`</think>`) ≈ 0 — the model never closes its thinking block.
+With the real CUDA kernel:
+- Decode runs in float32 (matches llama.cpp behavior)
+- SSM state matches training distribution
+- P(`</think>`) should be non-trivial
+- Enables fully on-policy GRPO without llama.cpp server
+---
+## Files Affected
+When rebuilding a fresh venv on this machine, always use the CUDA 13.0 path:
+```bash
+export CUDA_HOME=/usr/local/cuda-13.0
+export PATH=/usr/local/cuda-13.0/bin:$PATH
+```
+Add to `.venv-train/bin/activate` if you want it persistent:
+```bash
+echo 'export CUDA_HOME=/usr/local/cuda-13.0' >> .venv-train/bin/activate
+echo 'export PATH=/usr/local/cuda-13.0/bin:$PATH' >> .venv-train/bin/activate
+```

docs/NEMOTRON_GB10_DEEP_DIVE.md ADDED Viewed

	@@ -0,0 +1,321 @@

+# Nemotron-3-Nano-4B on GB10: Complete Deep Dive
+> Date: 2026-03-24
+> Purpose: Comprehensive record of all findings, blockers, and alternatives for training Nemotron-4B on NVIDIA DGX Spark (GB10, SM 12.1)
+## Executive Summary
+**Nemotron-3-Nano-4B cannot be trained on GB10.** The model uses Mamba-2 (SSM) layers that require custom CUDA/Triton kernels (`mamba_ssm`, `causal_conv1d`). These kernels don't compile on SM 12.1 (too new), and no pure-PyTorch fallback produces correct outputs. The model works perfectly for **inference** via llama.cpp.
+## Hardware Context
+- **GPU**: NVIDIA GB10 (Blackwell), SM 12.1, 128 GB unified memory
+- **SM 12.1** is the newest compute capability — released ahead of full software ecosystem support
+- **What works on GB10**: llama.cpp (hand-written CUDA, added SM 12.1 explicitly), standard PyTorch ops, vLLM 0.18.0 (with torch 2.10)
+- **What doesn't**: Any Python package with custom CUDA kernels compiled for SM ≤ 12.0
+## The Mamba-2 Problem
+NemotronH architecture = 38 Mamba-2 layers + 4 Attention layers + MLP layers (42 total).
+The Mamba-2 layers require three custom kernel packages:
+### 1. `mamba_ssm` (state-spaces/mamba)
+- **What**: Triton kernel for gated RMSNorm (`rmsnorm_fn`) and SSM scan (`mamba_chunk_scan_combined`)
+- **SM 12.1 status**: Won't compile (Triton JIT generates incompatible PTX)
+- **Why needed**: `rmsnorm_fn` is a gated RMSNorm with group normalization — used inside every Mamba-2 mixer. Not a standard operation.
+### 2. `causal_conv1d` (Dao-AILab)
+- **What**: Fused causal 1D convolution CUDA kernel
+- **SM 12.1 status**: Binary `.so` has undefined symbols (`_ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib`)
+- **Why needed**: Applied before SSM scan in every Mamba layer
+### 3. `deep_ep` (DeepSeek)
+- **What**: Expert parallelism for MoE models
+- **SM 12.1 status**: Only targets `sm_90`, aarch64 glibc headers incompatible with nvcc
+- **Why needed**: Hard dependency of NeMo RL's `[vllm]` extra (not actually needed for 4B dense model, but can't be skipped)
+## All Training Paths Attempted
+### Path 1: NeMo RL (bare metal)
+- **Result**: ❌ `deep_ep` compilation failure
+- **Detail**: Ray worker isolation creates fresh venvs, re-triggers the build
+### Path 2: NeMo RL (Docker)
+- **Result**: ❌ Triton JIT failure in vLLM worker
+- **Detail**: Container has CUDA 12.9, but Triton generates incompatible PTX for SM 12.1
+- **Note**: Container warns "WARNING: Detected NVIDIA GB10 GPU, which may not yet be supported"
+### Path 3: PyTorch + `trust_remote_code=True`
+- **Result**: ❌ `causal_conv1d` ImportError (broken `.so`)
+- **Fix attempted**: Uninstalled broken packages
+- **New result**: ❌ `mamba_ssm` hard-required at import time, `raise ImportError("mamba-ssm is required")`
+### Path 4: PyTorch + native transformers (no `trust_remote_code`)
+- **Result**: ❌ Config parser broken
+- **Detail**: `hybrid_override_pattern` uses `-` for MLP layers, but native transformers only recognizes `mamba`, `attention`, `moe` — not `mlp`
+- **Fix attempted**: Patched `_pattern_to_list` to map `-` → `mlp`
+- **New result**: ❌ `layers_block_type contains invalid types: {'mlp'}`
+- **Root cause**: transformers 5.3.0's NemotronH implementation doesn't have MLP as a block type. The model's architecture doesn't match what native transformers expects.
+### Path 5: PyTorch + mocked `mamba_ssm`
+- **Result**: ❌ Model loads but produces garbage outputs
+- **Detail**: Mocked `rmsnorm_fn` with exact reference implementation from mamba_ssm source. Also mocked `selective_state_update = None`, `causal_conv1d_fn = None`.
+- **Diagnostics**:
+  - All 263 weights load correctly
+  - All 42 layers produce non-trivial activations (no NaN, no collapse)
+  - BUT: model generates only spaces/punctuation (greedy decoding produces whitespace)
+  - Per-token log-probs: -4 to -12 (expected: -1 to -3 for working model)
+  - Perplexity: ~4,595 (expected: ~10-20)
+  - Top predictions at every position: `' '`, `'2'`, `'\n'`, `','` — not word tokens
+- **Root cause**: NVIDIA's `torch_forward` is a "naive implementation" (their comment) — it was never intended to match the Triton kernel outputs exactly. The SSM scan, chunking, and numerical precision differ enough that 38 layers of accumulated error produce garbage.
+### Path 6: GRPO v4 (llama.cpp generation + PyTorch mock training)
+- **Result**: ❌ Ran 24 steps, loss oscillated wildly (-6.5 to +8.5), reward flat
+- **Detail**: The training loop was architecturally correct (on-policy, LoRA, GGUF sync). But the PyTorch model's garbage log-probs meant the policy gradient signal was noise.
+- **Training data**: 24 steps × ~17 min = 6 hours wasted
+### Path 7: `llama-finetune` (GGML native training)
+- **Result**: ❌ Two bugs
+- **Bug 1**: Buffer underflow when input shorter than context (trivial fix, worked around with larger input)
+- **Bug 2**: `GGML_ASSERT(!node->view_src || ...)` failure in `ggml_build_backward_expand`. GGML's backward pass doesn't support SSM operations.
+- **GitHub issues**: [#15139](https://github.com/ggml-org/llama.cpp/issues/15139), [#15279](https://github.com/ggml-org/llama.cpp/issues/15279) — both open since Aug 2025, no fix
+- **Root cause**: llama.cpp added Mamba-2 **inference** support (Dec 2025, PR #18058) but not backward pass. Implementing SSM gradients in GGML is weeks of work.
+## Diagnostic Evidence
+### Layer-by-layer activation analysis (mock model)
+```
+Block                     Mean        Std   Status
+embedding              -0.0001     0.0128       OK
+block_00_mamba         -0.0001     0.0137       OK
+...
+block_20_mlp           -0.0325     0.9453       OK
+...
+block_41_mlp           -0.6250    24.8750       OK
+lm_head                -2.1406     1.9297       OK
+```
+Activations are fine — variance grows normally through layers. No collapse, no NaN. The model architecture is correct, but the probability distributions are wrong.
+### Token prediction analysis (mock model)
+```
+pos 0: actual=' is'  rank=116 | top5: ' '(10.94), '2'(10.88), '3'(10.12)
+pos 2: actual=' meaning'  rank=239 | top5: ' '(12.50), '\n'(11.12), ' ('(10.88)
+pos 4: actual=' life'  rank=4632 | top5: ' '(11.69), ' of'(10.56), ','(10.19)
+```
+The model ranks common words at positions 100-5000+, while spaces and numbers are top-ranked. This is consistent with the SSM scan producing wrong state transitions.
+### Memory test (dual model coexistence)
+```
+vLLM subprocess:    ~39 GB (0.3 × 130 GB)
+HF training model:  ~9 GB (BF16 + LoRA)
+Total:             ~55 GB | Free: ~76 GB
+```
+Memory is not a bottleneck. Both models fit comfortably.
+## What Works on GB10
+| Capability | Tool | Status |
+|-----------|------|--------|
+| Nemotron-4B inference | llama.cpp | ✅ Perfect, 30-60 tok/s |
+| Nemotron-4B inference | vLLM 0.18.0 (routangseng venv) | ✅ 3.5 tok/s |
+| Standard transformer training | PyTorch | ✅ No custom kernels needed |
+| Standard transformer inference | llama.cpp, vLLM | ✅ |
+| LoRA training (standard models) | PyTorch + PEFT | ✅ Tested with Qwen, Llama |
+## What Would Fix This
+### Short-term fixes (someone else needs to do the work)
+1. **PyTorch 2.11+** with SM 12.1 in supported range → `mamba_ssm` recompile
+2. **Triton update** with SM 12.1 PTX codegen → NeMo RL Docker works
+3. **llama.cpp backward pass** for Mamba-2 → `llama-finetune` works
+4. **NVIDIA's `torch_forward` fixed** to match Triton kernel → PyTorch mock works
+### What we can do now
+1. **Option 2**: SFT distillation — Nemotron as teacher (llama.cpp), standard transformer as student
+2. **Option 3**: Rent A100/H100 ($1-3/hr), train with NeMo RL directly
+3. **Ship base model** (4.35/5 eval score, already best-in-class)
+## On A100/H100
+Every blocker is SM 12.1 specific. On A100 (SM 8.0) or H100 (SM 9.0):
+- `mamba_ssm`, `causal_conv1d`: ✅ Primary build targets
+- NeMo RL: ✅ NVIDIA trains their own models on H100 clusters
+- vLLM: ✅
+- flash-attn: ✅
+- Estimated cost: $10-25 for a full training run (4-8 hours on A100)
+## Key Lessons
+1. **SM 12.1 is ahead of the software ecosystem.** The DGX Spark hardware works, but the Python ML toolchain hasn't caught up. Every custom CUDA kernel package needs to add SM 12.1 support independently.
+2. **llama.cpp works because it controls the whole stack.** One team, one codebase, added SM 12.1 to CMakeLists.txt, done. The Python ecosystem is a chain of 6+ independent projects that all need to update.
+3. **A model that loads correctly can still be completely broken.** All 263 weights loaded, all 42 layers had non-trivial activations, but the outputs were garbage. Always validate generation quality before training.
+4. **NVIDIA's `torch_forward` is not a production fallback.** It's labeled "naive implementation" for a reason. It produces directionally correct activations but wrong probability distributions.
+5. **Mocking CUDA kernels is dangerous.** Even with the exact reference implementation from the same author, the mock didn't match. Numerical precision differences compound across 38 Mamba layers.
+## A100 Colab Reference Test (2026-03-24 22:00-23:00 UTC)
+### Setup
+- Google Colab A100-SXM4-40GB
+- SSH tunnel via bore (bore.vexorium.net)
+- Installed real `mamba_ssm` 2.3.1 + `causal_conv1d` 1.6.1 (pre-built wheels saved to `lex-ft/wheels/`)
+- transformers 5.0.0, torch 2.10.0+cu128
+### Critical Finding: Model is Broken in HuggingFace on ALL GPUs
+**Even on A100 with real CUDA kernels, the model generates garbage:**
+```
+RAW TEXT:    "What is the meaning of life? 22,,. 22,  1                 "
+CHAT TEMPLATE: "????????????????????????????????????????"
+```
+Top-10 predictions at last position (A100, real CUDA kernels):
+```
+'\n'   logit=14.12
+' '    logit=11.06
+'2'    logit=10.94
+'\n\n' logit=10.88
+'1'    logit=10.69
+'3'    logit=10.56
+```
+The model predicts newlines, spaces, and numbers — not words. **This happens on A100 with `is_fast_path_available=True` and `cuda_kernels_forward` active.**
+### What This Means
+1. **The mock was NOT the problem.** We spent 30+ hours blaming the `rmsnorm_fn` mock and SM 12.1 toolchain, but the HuggingFace loading itself is broken.
+2. **llama.cpp works perfectly** with the same model weights (BF16 GGUF), proving the weights are correct.
+3. **The bug is in HuggingFace `from_pretrained` + `trust_remote_code=True`** for this specific model on transformers 5.0.0.
+### Layer-by-Layer Comparison (A100 vs GB10)
+Reference tensors captured and compared:
+- Embedding: ✅ Perfect match (0.0 diff)
+- block_00_mamba: ❌ 27% relative difference (diverges immediately)
+- All subsequent layers: ❌ Increasing divergence
+But since the A100 model ALSO produces garbage, the divergence between A100 and GB10 is between **two broken implementations**, not "correct vs broken."
+### CUDA vs torch_forward on A100
+Single Mamba layer comparison:
+```
+cuda_kernels_forward: mean=-0.000183, std=0.039062
+torch_forward:        mean=-0.000233, std=0.041504
+Max diff: 0.252930, Relative diff: 48.7%
+```
+The two paths produce significantly different outputs, but **neither produces correct model behavior.**
+### Possible Root Causes (to investigate)
+1. **transformers version incompatibility**: The model's custom code may require a specific older transformers version
+2. **Weight mapping bug**: The HF repo weights may be mapped to wrong layers by `from_pretrained`
+3. **Config mismatch**: The `config.json` in the BF16 HF repo may differ from what the model code expects
+4. **Missing post-processing**: The model may need specific initialization that `from_pretrained` skips
+### Artifacts Saved
+- Pre-built wheels: `lex-ft/wheels/mamba_ssm-2.3.1-cp312-cp312-linux_x86_64.whl` (509 MB)
+- Pre-built wheels: `lex-ft/wheels/causal_conv1d-1.6.1-cp312-cp312-linux_x86_64.whl` (243 MB)
+- Reference tensors: `lex-ft/reference/reference_tensors.pt` (49 MB) — A100 activations for 3 test texts
+- Colab notebook: `lex-ft/notebooks/capture_reference_tensors.ipynb`
+## Updated Conclusion (2026-03-24 23:00 UTC)
+**The root cause was misidentified.** We blamed SM 12.1 and the `rmsnorm_fn` mock for 30+ hours, but the A100 test proves:
+1. The model generates garbage on A100 too (real CUDA kernels, `is_fast_path_available=True`)
+2. The `torch_forward` and `cuda_kernels_forward` do diverge (49%), but neither is correct
+3. llama.cpp generates perfect text with the same weights
+**The real bug is in HuggingFace transformers' loading/inference of this model**, not in GPU compatibility. This completely changes the path forward:
+- **If we fix the HF loading bug**: Training works on ANY GPU (including GB10 with mock)
+- **If we can't fix it**: Option 2 (SFT distillation) or Option 3 (cloud + NeMo RL) remain viable
+### Next Steps
+1. Investigate why HF `from_pretrained` produces garbage while llama.cpp works
+2. Check if NVIDIA's NeMo toolkit loads this model correctly (it should — they train with it)
+3. Check if the HF repo `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` has a known issue or requires specific transformers version
+4. Compare weight names/shapes between HF and GGUF to find mapping errors
+---
+## Q8 vs BF16 Mismatch Quantification (2026-03-25 03:35 UTC)
+### Test Setup
+- llama.cpp Q8 server: generates greedy next token at each position
+- PyTorch BF16 model (transformers 4.48.3 + mock `torch_forward`): computes log-probs
+- 3 test texts, 10 positions each
+### Results
+| Metric | Value |
+|--------|-------|
+| Q8-BF16 top-1 agreement | **27%** |
+| Q8-BF16 top-5 agreement | **43%** |
+| PyTorch BF16 perplexity | **250-443** (expected: 10-50 for working model) |
+### Position-Level Examples
+```
+Text: "The meaning of life is to find purpose..."
+  pos 3: actual=' is'    PT=' of'     Q8=' is'     PT_lp=-3.91
+  pos 5: actual=' find'  PT=' the'    Q8=' find'   PT_lp=-5.07
+  pos 6: actual=' purpose' PT=' a'   Q8=' your'   PT_lp=-10.59
+```
+**llama.cpp Q8 predictions are much closer to ground truth than PyTorch BF16.** Q8 predicts "is", "find" correctly; PyTorch predicts "of", "the" — generic tokens, not context-appropriate ones.
+### Root Cause: `torch_forward` SSM Scan is Numerically Wrong
+The PyTorch perplexity of 250-443 on simple English text (should be 10-50) confirms that the `torch_forward` naive SSM implementation produces **wrong probability distributions**, not just slightly different ones.
+This is NOT a quantization mismatch (Q8 vs BF16). Even comparing BF16 PyTorch against the same BF16 weights in llama.cpp would show the same divergence — the issue is `torch_forward` vs llama.cpp's C++/CUDA Mamba-2 kernels computing different results.
+Earlier evidence supports this:
+- A100 reference test: block_00_mamba already diverges 27% from torch_forward
+- `cuda_kernels_forward` vs `torch_forward` on A100: 49% relative difference on same layer
+- NVIDIA's code labels this path "naive implementation" — it was never intended for production
+### Implications for Training
+**GRPO with Q8 generation + BF16 torch_forward training is fundamentally unstable** because:
+1. The model that generates (llama.cpp) and the model that computes gradients (PyTorch) disagree on 73% of top-1 predictions
+2. Policy ratios `π_new/π_old` are noise when the two probability distributions barely overlap
+3. This explains the oscillating loss (-6.2 to +3.9) and exploding grad norms (644-1207) in all GRPO runs
+**SFT training is also compromised** because the forward pass computes wrong CE loss:
+- CE loss 3.88 with 4.48.3 (looked reasonable but perplexity 250-443 is NOT reasonable)
+- The model "learns" but from wrong gradients
+- This is why the smoke test SFT produced "assistant assistant assistant" — the model was memorizing surface patterns, not learning from correct probability gradients
+## Final Definitive Conclusion (2026-03-25)
+**Nemotron-3-Nano-4B cannot be correctly trained on GB10 with ANY approach** because the `torch_forward` Mamba-2 SSM scan produces numerically incorrect results. This is true regardless of:
+- transformers version (4.48.3 or 5.x)
+- GPU (GB10 or A100 — A100 with `torch_forward` has the same issue)
+- Training method (GRPO, SFT, custom loop)
+- Mock implementation (Kaggle-style, our version, any pure PyTorch)
+The ONLY correct training paths require the real `mamba_ssm` CUDA/Triton kernels, which requires SM 8.0+ (A100/H100) — NOT available on GB10 (SM 12.1).
+### Viable Paths Forward
+1. **Cloud A100 training** ($10-25): Install transformers 4.48.3 + real `mamba_ssm` kernels on A100. Train with NeMo RL or SFTTrainer. Deploy result on GB10 via llama.cpp.
+2. **SFT distillation to standard transformer**: Use llama.cpp Nemotron for generation (works perfectly), train a Qwen/Llama student model (standard transformers, no Mamba-2) on that data. No `torch_forward` needed.
+3. **Ship base model**: Nemotron-4B base scores 4.35/5. Already best-in-class.
+---
+*Total investigation time: ~40 hours across 2026-03-23 to 2026-03-25*
+*Approaches tried: 9*
+*Lines of test code written: ~4,000*
+*Training steps run: 33 (all from wrong gradients)*
+*Key finding: `torch_forward` SSM scan is numerically wrong — not a workaround-able issue*

docs/NEMO_RL_SETUP_NOTES.md ADDED Viewed

	@@ -0,0 +1,168 @@

+# NeMo RL Setup Notes — GB10
+> Date: 2026-03-23
+## Installation Status
+### ✅ Base NeMo RL (v0.5.0rc0)
+- Cloned to `/home/bobber/nemo-rl` (with submodules)
+- Python 3.12, torch 2.9.0, uv 0.11.0
+- `uv venv` + `uv run python -c "import nemo_rl"` → works
+- 215 packages installed
+### ✅ vLLM 0.11.2 (manually installed)
+- `uv pip install vllm` → installed vLLM 0.11.2
+- NemotronH model support: `nemotron_h.py` exists ✅
+- Mamba2 support: `mamba2.py` exists ✅
+### ❌ `[vllm]` extra failed
+- `deep_ep` (DeepSeek Expert Parallelism) fails to compile on aarch64
+  - Targets `sm_90` only, `__builtin_dynamic_object_size` not found in glibc headers with nvcc
+  - This is a multi-node MoE optimization — **not needed** for single-GPU 4B training
+- Workaround: installed vLLM directly without deep_ep
+### ❌ Not yet tested
+- `flash-attn`, `mamba-ssm`, `causal-conv1d` (the `[fsdp]` extra)
+  - These had build issues in the routangseng venv too
+  - NeMo RL's DTensor training backend might need them
+  - vLLM has its own Mamba kernels that work on GB10
+## Architecture
+NeMo RL's GRPO flow:
+1. **Generation**: vLLM generates completions (subprocess, ~39 GB)
+2. **Training**: DTensor or Megatron backend trains the model (main process)
+3. **Weight sync**: After training step, weights are synced to vLLM
+4. **Repeat**: On-policy — same model generates and trains
+Key config file: `examples/configs/grpo_math_1B.yaml`
+- Uses `Qwen/Qwen2.5-1.5B` by default
+- Single GPU mode: `cluster.gpus_per_node: 1`
+- DTensor backend (PyTorch native)
+- vLLM generation with `gpu_memory_utilization: 0.6`
+## LoRA GRPO for Nemotron-3-Nano-4B — Feasibility Analysis
+### ✅ Verdict: Feasible, with caveats
+NeMo RL has **first-class LoRA GRPO support for NemotronH** architecture. NVIDIA ships a recipe for the 30B-A3B variant. The 4B model should work with adaptations.
+### How NeMo RL's LoRA GRPO Works
+1. **Model loading**: `AutoModelForCausalLM.from_pretrained()` with `trust_remote_code=True`
+2. **LoRA injection**: `apply_lora_to_linear_modules()` wraps selected `nn.Linear` layers with `LinearLoRA`
+3. **Training**: DTensor backend (PyTorch FSDP2) trains only LoRA params
+4. **Weight sync to vLLM**: LoRA weights are **merged back** into base weights before sending to vLLM
+   - `_maybe_merge_lora_weight()` computes `W + B×A × (alpha/dim)`
+   - vLLM always sees the full merged model — no separate LoRA loading needed
+5. **On-policy**: Same (merged) model generates and trains each step
+### Reference: NVIDIA's 30B-A3B LoRA Recipe
+```yaml
+# grpo-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
+policy:
+  model_name: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
+  dtensor_cfg:
+    lora_cfg:
+      enabled: true
+      dim: 128         # rank
+      alpha: 512       # scaling = alpha/dim = 4x
+      exclude_modules: ['*out_proj*']  # ← KEY: Mamba2 out_proj has no gradient with CUDA kernels
+      match_all_linear: false
+      use_triton: false
+```
+### GB10-Specific Advantages for LoRA
+On GB10, `causal_conv1d` and `mamba_ssm` CUDA kernels **won't build** (SM 12.1 incompatibility). The model falls back to `torch_forward` (pure PyTorch). This actually **helps** us:
+- `torch_forward` path: all ops are standard PyTorch → **all layers have gradients**, including `out_proj`
+- 30B recipe excludes `out_proj` because `cuda_kernels_forward` doesn't backprop through it
+- On GB10 with `torch_forward`: we can **include** `out_proj` → train more of the model
+### DTensor Parallelization for NemotronH
+NeMo RL has explicit NemotronH support in `nemo_rl/models/dtensor/parallelize.py`:
+- Custom `_parallelize_nm5_h()` function
+- Shards MLP layers: `mixer.up_proj` (Colwise), `mixer.down_proj` (Rowwise)
+- Mamba layers are NOT tensor-parallel sharded (they can't be easily split)
+- Activation checkpointing supported for both MLP and Mamba layers
+- For single GPU: no TP needed, just FSDP2
+### Proposed 4B Single-GPU LoRA Config
+```yaml
+defaults: grpo_math_1B.yaml
+policy:
+  model_name: /home/bobber/lex-ft/models/NVIDIA-Nemotron-3-Nano-4B
+  tokenizer:
+    name: /home/bobber/lex-ft/models/NVIDIA-Nemotron-3-Nano-4B
+  train_global_batch_size: 16
+  train_micro_batch_size: 1
+  logprob_batch_size: 1
+  max_total_sequence_length: 800  # based on interview segment length data
+  dtensor_cfg:
+    lora_cfg:
+      enabled: true
+      dim: 64           # smaller rank than 30B (4B model needs less capacity)
+      alpha: 256         # scaling = 4x
+      # Do NOT exclude out_proj on GB10 — torch_forward gives us gradients everywhere
+      exclude_modules: []
+      match_all_linear: true  # apply LoRA to ALL linear layers (Mamba + Attention + MLP)
+      use_triton: false        # no flash-attn on GB10
+    activation_checkpointing: true  # save memory
+    cpu_offload: false
+  sequence_packing:
+    enabled: false  # start simple
+  generation:
+    max_new_tokens: 800
+    vllm_cfg:
+      gpu_memory_utilization: 0.3  # tested safe value for GB10
+      enforce_eager: true  # skip CUDA graphs to save memory
+cluster:
+  gpus_per_node: 1
+```
+### Memory Estimate (LoRA)
+From dual memory test:
+- vLLM subprocess: ~39 GB (0.3 utilization)
+- Base model (BF16): ~8 GB
+- LoRA params (rank 64, all linear): ~0.2 GB
+- LoRA optimizer states: ~0.4 GB
+- Activations/gradients: ~5-10 GB (with activation checkpointing)
+- **Total: ~53-58 GB | Free: ~73-78 GB** ← comfortable
+### Known Risks
+1. **causal_conv1d import error**: The model's `modeling_nemotron_h.py` does `from causal_conv1d import ...` at import time. It has a fallback (`causal_conv1d_fn = None`), but the import itself may fail with `ImportError` if a broken .so exists. May need to patch or mock.
+2. **transformers version**: NeMo RL pins `transformers==4.57.1` (no native NemotronH). Uses `trust_remote_code=True` which loads NVIDIA's custom code. This is the intended path for NemotronH.
+3. **vLLM 0.11.2 vs 0.18.0**: NeMo RL pins vLLM 0.11.2. Our dual memory test used 0.18.0 from the routangseng venv. Need to verify NemotronH works in 0.11.2 (it has `nemotron_h.py`, should be fine).
+4. **torch_forward speed**: Pure PyTorch Mamba is slower than CUDA kernels. Training will be slower than on an A100/H100 with full kernel support. But it will be **correct**.
+5. **Weight sync overhead**: Each GRPO step merges LoRA → syncs to vLLM → generates → trains. The merge is cheap (matrix multiply), but vLLM restart may not be needed if using colocated mode.
+## Next Steps
+1. **Create the GRPO config** (yaml above as starting point)
+2. **Write custom reward function** for interviewer quality
+3. **Test with a simple SFT run first** to validate the pipeline
+4. **Then run LoRA GRPO** with the interviewer reward
+## Key Advantages of NeMo RL over Custom GRPO
+| Feature | Custom GRPO v3 | NeMo RL |
+|---------|----------------|---------|
+| On-policy | ❌ llama.cpp ≠ HF | ✅ vLLM generates, same model trains |
+| KL reference | ❌ Missing | ✅ Built-in reference policy |
+| Architecture | ❌ LoRA only touched 4/42 layers | ✅ LoRA on ALL linear layers (torch_forward) |
+| Weight sync | ❌ Manual merge every N steps | ✅ Automatic merge+sync per step |
+| LoRA GRPO | ❌ Not supported | ✅ DTensor LoRA GRPO with merge-to-vLLM |
+| Tested on Nemotron | ❌ No | ✅ NVIDIA ships 30B-A3B recipe |
+| Mamba gradients | ❌ Only 4 attention layers | ✅ All 42 layers via torch_forward |

docs/ONNX_RETROSPECTIVE.md ADDED Viewed

	@@ -0,0 +1,404 @@

+# ONNX Model Export & WebGPU Deployment — Retrospective
+## Project
+Deploy Nemotron-3-Nano-4B (GRPO v12 fine-tuned LoRA) as a browser-based WebGPU chat app via HuggingFace Spaces using transformers.js.
+- **Space**: `bobber/lex-interviewer-chat` (static HF Space)
+- **Model**: `bobber/lex-interviewer-nemotron-4b-grpo-v12`
+- **Reference**: `onnx-community/NVIDIA-Nemotron-3-Nano-4B-BF16-ONNX` + `webml-community/Nemotron-3-Nano-WebGPU`
+- **Date**: 2026-03-31
+---
+## Timeline of Issues
+### 1. WASM 404 / asyncify.mjs (ort-web version conflict)
+- **Symptom**: 404 on `ort-wasm-simd-threaded.asyncify.mjs`
+- **Cause**: `package.json` had ort-web pinned to 1.16.3, which doesn't ship `asyncify.mjs`
+- **Fix**: Remove ort-web override; let transformers.js 4.0.0-next.8 use its bundled ort-web 1.25.0-dev
+### 2. Module.MountedFiles not available
+- **Symptom**: `Failed to load external data file "model_q4.onnx_data", error: Module.MountedFiles is not available`
+- **Cause**: Missing `transformers.js_config` in model's `config.json`. Without `use_external_data_format`, transformers.js falls back to the old Emscripten `Module.MountedFiles` API (removed in ort-web 1.25+)
+- **Fix**: Add `transformers.js_config: { use_external_data_format: { "model_q4.onnx": 2 } }` to config.json
+### 3. ShapeInferenceError on INT64 constants
+- **Symptom**: `Cannot parse data from external tensors` for INT64 constants
+- **Cause**: ONNX repacking with `size_threshold=0` moved ALL tensors external, including small constants that ORT needs inline
+- **Fix**: Repack with `size_threshold=1024` (tensors < 1KB stay inline)
+### 4. ArrayBuffer allocation failed
+- **Symptom**: `RangeError: Array buffer allocation failed`
+- **Cause**: Merging split data files into a single 2.55 GB blob exceeded browser's ~2 GB ArrayBuffer limit
+- **Fix**: Keep original 2-file split (~2.09 GB + ~465 MB), each under 2 GB
+### 5. Missing chat_template
+- **Symptom**: `Cannot use apply_chat_template() because tokenizer.chat_template is not set`
+- **Cause**: Our model repo lacked `chat_template` in `tokenizer_config.json`
+- **Fix**: Copy chat_template from reference model's tokenizer_config.json
+### 6. Numeric gibberish output
+- **Symptom**: Model generates random numbers and symbols
+- **Cause**: Wrong tokenizer — our repo had a different `tokenizer.json` (17 MB vs reference 12.6 MB). Same vocab size but different byte-pair encoding → token IDs decoded to wrong text
+- **Fix**: Use reference model's `tokenizer.json` and `tokenizer_config.json` (same base model, same vocab)
+### 7. Trailing newlines / infinite generation
+- **Symptom**: Model generates answer then infinite `\n` and `<|im_end|>` tokens
+- **Cause**: `generation_config.json` had `eos_token_id: 2` but was missing token 11 (`<|im_end|>`). Model generated end-of-turn but didn't stop
+- **Fix**: Set `eos_token_id: [2, 11]` in both generation_config.json and as runtime override in Space code
+### 8. "No response" with Reasoning On (THE BIG ONE)
+- **Symptom**: With `enable_thinking: true`, model outputs only `<|im_end|>` immediately (1 chunk, zero content). With `enable_thinking: false`, model works fine
+- **Root cause**: Our custom `quantize_to_matmulnbits()` re-quantized ALL 94 layers. The re-quantized Mamba layers had tiny precision differences from the reference's quantization. On WebGPU (float16 compute), these differences caused the model to output immediate EOS after the `<think>` token
+- **Why CPU worked**: CPU uses float32 for dequantization, which is more tolerant of the precision differences
+- **Fix**: LoRA-only patching — keep reference Q4 weights for non-LoRA layers (Mamba, embedding, lm_head), only re-quantize the 50 layers that LoRA actually changed (attention q/k/v/o_proj + MLP up/down_proj)
+---
+## Root Cause Analysis
+### Why re-quantizing non-LoRA layers broke WebGPU
+The reference ONNX model was quantized by `onnx-community` using their official tooling. Our custom `quantize_to_matmulnbits()` function uses asymmetric uint4 quantization:
+```python
+scales = (block_max - block_min) / 15.0
+zp = round(-block_min / scales)
+q = round(w / scales + zp).clip(0, 15)
+```
+While mathematically correct, different implementations produce slightly different rounding for edge cases. The reference's quantizer may use a different rounding strategy, tie-breaking, or block boundary handling.
+On CPU (float32), these differences are negligible — the dequantized values are close enough. On WebGPU (float16 compute), the accumulated precision loss across 42 layers is enough to cause the model's internal state to diverge, particularly for rarely-exercised code paths like the `<think>` token processing.
+**Key insight**: Quantization is NOT commutative. `requantize(dequantize(reference_q4))` ≠ `reference_q4` even for identical weights. The reference's quantization produces specific rounding patterns that the model's behavior depends on at float16 precision.
+---
+## Conversion Guide: Nemotron-3 LoRA to ONNX Q4 for WebGPU
+### Prerequisites
+- Fine-tuned merged model in safetensors format
+- Reference ONNX Q4 model from `onnx-community/NVIDIA-Nemotron-3-Nano-4B-BF16-ONNX`
+- Python with `onnx`, `safetensors`, `numpy`, `huggingface_hub`
+- Know which layers your LoRA modified (check `adapter_config.json` → `target_modules`)
+### Step 1: Identify LoRA Target Layers
+```python
+import json
+from huggingface_hub import hf_hub_download
+cfg = json.load(open(hf_hub_download('your-repo', 'adapter/adapter_config.json')))
+print(f"LoRA targets: {cfg['target_modules']}")
+# e.g., ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj']
+```
+### Step 2: Download Reference Q4 Model
+```python
+from huggingface_hub import snapshot_download
+snap = snapshot_download(
+    'onnx-community/NVIDIA-Nemotron-3-Nano-4B-BF16-ONNX',
+    allow_patterns='onnx/model_q4*'
+)
+```
+### Step 3: Copy Reference Files as Base
+```python
+import shutil
+from pathlib import Path
+ONNX_Q4_BASE = Path(snap) / 'onnx'
+OUT_DIR = Path('/tmp/onnx-output/onnx')
+OUT_DIR.mkdir(parents=True, exist_ok=True)
+for f in ONNX_Q4_BASE.iterdir():
+    if 'model_q4' in f.name:
+        shutil.copy2(f, OUT_DIR / f.name)
+```
+### Step 4: Load Merged Safetensors
+```python
+import torch
+from safetensors import safe_open
+st_tensors = {}
+with safe_open('models/merged/model.safetensors', framework='pt', device='cpu') as f:
+    for k in f.keys():
+        st_tensors[k] = f.get_tensor(k).float().numpy()
+```
+### Step 5: Name Mapping (ONNX ↔ Safetensors)
+Nemotron-H has a non-standard naming convention:
+| ONNX Name | Safetensors Name |
+|-----------|-----------------|
+| `model.layers.N.input_layernorm.weight` | `backbone.layers.N.norm.weight` |
+| `model.layers.N.mamba.{component}` | `backbone.layers.N.mixer.{component}` |
+| `model.layers.N.attn.{component}` | `backbone.layers.N.mixer.{component}` |
+| `model.layers.N.mlp.{component}` | `backbone.layers.N.mixer.{component}` |
+| `model.embed_tokens.weight` | `backbone.embedding.weight` |
+| `model.norm.weight` | `backbone.norm_f.weight` |
+For Q4 tensors, the ONNX names use underscores:
+- `model_layers_0_mamba_in_proj_MatMul_weight_quant` → `backbone.layers.0.mixer.in_proj.weight`
+```python
+import re
+def map_float_name(n, st_names):
+    """Map ONNX float tensor name to safetensors name."""
+    if n.startswith('/') or any(x in n for x in [
+            'INT64','FLOAT','constants','expanded','unsqueezed',
+            'squeezed','neg_exp','f32','split_sizes']):
+        return None
+    m = n
+    m = m.replace('model.embed_tokens.weight', 'backbone.embedding.weight')
+    m = re.sub(r'^model\.norm\.weight$', 'backbone.norm_f.weight', m)
+    m = re.sub(r'^model\.lm_head', 'lm_head', m)
+    m = re.sub(r'^model\.layers\.(\d+)\.input_layernorm\.weight',
+               r'backbone.layers.\1.norm.weight', m)
+    m = re.sub(r'^model\.layers\.(\d+)\.pre_ff_layernorm\.weight',
+               r'backbone.layers.\1.norm2.weight', m)
+    m = m.replace('model.layers.', 'backbone.layers.')
+    m = m.replace('.mamba.', '.mixer.').replace('.attn.', '.mixer.').replace('.mlp.', '.mixer.')
+    m = re.sub(r'\.MatMul\.weight$', '.weight', m)
+    return m if m in st_names else None
+def map_q4_base_to_st(base_name, st_names):
+    """Map Q4 initializer base name to safetensors weight name."""
+    m = re.sub(r'_weight$', '', base_name)
+    if m == 'lm_head_MatMul':
+        return 'lm_head.weight'
+    if m == 'model_embed_tokens':
+        return 'backbone.embedding.weight'
+    match = re.match(r'model_layers_(\d+)_(mamba|attn|mlp)_(.+)_MatMul$', m)
+    if not match:
+        return None
+    layer, sub, comp = match.groups()
+    st_name = f'backbone.layers.{layer}.mixer.{comp}.weight'
+    return st_name if st_name in st_names else None
+```
+### Step 6: Quantization Function
+```python
+def quantize_to_matmulnbits(weight_f32, N, K, block_size=32):
+    """Asymmetric uint4 block quantization matching ORT MatMulNBits format."""
+    w = weight_f32.astype(np.float32)
+    assert w.shape == (N, K)
+    n_blocks = (K + block_size - 1) // block_size
+    K_pad = n_blocks * block_size
+    if K_pad > K:
+        w = np.pad(w, ((0, 0), (0, K_pad - K)))
+    w_blocks = w.reshape(N, n_blocks, block_size)
+    block_min = w_blocks.min(axis=-1)
+    block_max = w_blocks.max(axis=-1)
+    scales = (block_max - block_min) / 15.0
+    scales = np.where(scales == 0, 1.0, scales).astype(np.float32)
+    zp_float = -block_min / scales
+    zp = np.round(zp_float).clip(0, 15).astype(np.uint8)
+    q = np.round(w_blocks / scales[:, :, np.newaxis] + zp[:, :, np.newaxis])
+    q = q.clip(0, 15).astype(np.uint8)
+    # Pack two nibbles per byte (low nibble first)
+    q_pairs = q.reshape(N, n_blocks, block_size // 2, 2)
+    packed = (q_pairs[..., 0] | (q_pairs[..., 1] << 4)).astype(np.uint8)
+    # Pack zero points as nibbles
+    n_zp_pairs = (n_blocks + 1) // 2
+    zp_packed = np.zeros((N, n_zp_pairs), dtype=np.uint8)
+    for i in range(n_blocks):
+        byte_idx = i // 2
+        if i % 2 == 0:
+            zp_packed[:, byte_idx] |= zp[:, i]
+        else:
+            zp_packed[:, byte_idx] |= (zp[:, i] << 4)
+    return packed, scales, zp_packed
+```
+### Step 7: Patch Weights (LoRA Targets Only!)
+**CRITICAL**: Only re-quantize layers that LoRA modified. Keep reference weights for everything else.
+```python
+import onnx
+from onnx.external_data_helper import ExternalDataInfo
+LORA_TARGETS = {'q_proj', 'k_proj', 'v_proj', 'o_proj', 'up_proj', 'down_proj', 'gate_proj'}
+model = onnx.load(str(OUT_DIR / 'model_q4.onnx'), load_external_data=False)
+# Get MatMulNBits attributes
+matmul_attrs = {}
+for node in model.graph.node:
+    if node.op_type == 'MatMulNBits' and len(node.input) >= 2:
+        d = {a.name: a.i for a in node.attribute}
+        matmul_attrs[node.input[1]] = (d.get('N', 0), d.get('K', 0), d.get('block_size', 32))
+# Patch float tensors (layernorms, conv1d, biases — always safe to patch)
+for init in model.graph.initializer:
+    if any(init.name.endswith(s) for s in ['_quant', '_scales', '_zp']):
+        continue
+    st_name = map_float_name(init.name, set(st_tensors.keys()))
+    if st_name is None:
+        continue
+    ext = ExternalDataInfo(init)
+    if not ext.location:
+        continue
+    our_arr = st_tensors[st_name]
+    onnx_dtype = {1: np.float32, 10: np.float16}.get(init.data_type, np.float32)
+    our_bytes = our_arr.astype(onnx_dtype).tobytes()
+    if ext.length and len(our_bytes) != ext.length:
+        continue
+    with open(OUT_DIR / ext.location, 'r+b') as f:
+        f.seek(ext.offset or 0)
+        f.write(our_bytes)
+# Patch Q4 tensors — ONLY LoRA targets
+q4_groups = {}
+for init in model.graph.initializer:
+    if any(init.name.endswith(s) for s in ['_quant', '_scales', '_zp']):
+        base = re.sub(r'_(quant|scales|zp)$', '', init.name)
+        q4_groups.setdefault(base, {})[init.name[len(base)+1:]] = init
+for base_name, group in q4_groups.items():
+    if 'quant' not in group:
+        continue
+    # SKIP non-LoRA layers — keep reference weights!
+    if not any(target in base_name for target in LORA_TARGETS):
+        continue
+    st_name = map_q4_base_to_st(base_name, set(st_tensors.keys()))
+    if st_name is None:
+        continue
+    quant_init = group['quant']
+    N, K, bs = matmul_attrs.get(quant_init.name, (0, 0, 32))
+    if N == 0:
+        q_dims = list(quant_init.dims)
+        if len(q_dims) == 3:
+            N, n_b, half_bs = q_dims
+            bs = half_bs * 2
+            K = n_b * bs
+    weight = st_tensors[st_name]
+    if weight.shape == (K, N):
+        weight = weight.T
+    packed, scales, zp_packed = quantize_to_matmulnbits(weight, N, K, bs)
+    # Write quant, scales, zp
+    for suffix, data in [('quant', packed), ('scales', scales), ('zp', zp_packed)]:
+        if suffix not in group:
+            continue
+        ext = ExternalDataInfo(group[suffix])
+        data_bytes = data.tobytes()
+        if ext.length and len(data_bytes) == ext.length:
+            with open(OUT_DIR / ext.location, 'r+b') as f:
+                f.seek(ext.offset or 0)
+                f.write(data_bytes)
+```
+### Step 8: Upload to HuggingFace
+```python
+from huggingface_hub import HfApi
+api = HfApi()
+api.upload_folder(
+    folder_path=str(OUT_DIR),
+    path_in_repo="onnx",
+    repo_id="your-repo",
+    repo_type="model",
+    commit_message="LoRA-only Q4 patch from reference base"
+)
+```
+### Step 9: Set Model Config
+Ensure these files exist in your HF repo:
+**config.json** — must include:
+```json
+{
+  "transformers.js_config": {
+    "use_external_data_format": {
+      "model_q4.onnx": 2
+    }
+  }
+}
+```
+**generation_config.json** — must include:
+```json
+{
+  "eos_token_id": [2, 11]
+}
+```
+**tokenizer.json** + **tokenizer_config.json** — use the reference model's tokenizer (same vocab, includes `chat_template`).
+### Step 10: Test with WebGPU Test Page
+Create a standalone HTML test page (see `dist/test-webgpu.html` in the Space repo) that:
+1. Loads the model with `pipeline('text-generation', modelId, { dtype: 'q4', device: 'webgpu' })`
+2. Tests both `enable_thinking: true` and `false`
+3. Checks for `</think>` in output
+4. Reports chunk count and content
+---
+## Key Lessons
+### 1. Never re-quantize unchanged layers
+If LoRA only touches attention/MLP projections, keep the reference's Q4 weights for Mamba, embedding, and other untouched layers. Re-quantization introduces precision differences that break WebGPU inference.
+### 2. WebGPU ≠ CPU for quantized models
+Float16 compute on WebGPU amplifies tiny quantization differences. Always test on actual WebGPU hardware, not just CPU/WASM.
+### 3. The reference model is your ground truth
+Start from the reference ONNX and make minimal changes. Compare behavior at every step.
+### 4. Build a WebGPU test harness early
+A standalone HTML test page that runs `enable_thinking: true/false` with both models saves hours of manual testing.
+### 5. Config files matter
+Missing `transformers.js_config`, wrong `eos_token_id`, wrong `tokenizer.json` — each caused distinct failures. Use the reference model's configs as a template and only change what's necessary.
+### 6. Browser cache is sticky
+`env.useBrowserCache = true` caches model files aggressively. When debugging, clear Cache Storage (not just regular cache) or use incognito mode.
+---
+## File Inventory
+| File | Source | Notes |
+|------|--------|-------|
+| `onnx/model_q4.onnx` | Reference | Graph structure (unchanged) |
+| `onnx/model_q4.onnx_data` | Mixed | Reference base + LoRA patches |
+| `onnx/model_q4.onnx_data_1` | Mixed | Reference base + LoRA patches |
+| `config.json` | Modified | Added `transformers.js_config` |
+| `generation_config.json` | Modified | `eos_token_id: [2, 11]` |
+| `tokenizer.json` | Reference | Must match ONNX vocab |
+| `tokenizer_config.json` | Reference | Includes `chat_template` |
+## Scripts
+| Script | Purpose |
+|--------|---------|
+| `scripts/patch_q4_inplace.py` | Original (BROKEN) — re-quantizes all layers |
+| `scripts/patch_q4_loraonly.py` | Fixed — only patches LoRA target layers |
+| `dist/test-webgpu.html` | WebGPU test harness for both models |

docs/OPTION2_SFT_DISTILLATION_PLAN.md ADDED Viewed

	@@ -0,0 +1,193 @@

+# Option 2: SFT Distillation Plan
+> Date: 2026-03-24
+> Status: Planning
+## Overview
+Use Nemotron-4B as a **teacher** (via llama.cpp) to generate high-quality interviewer data. Train a **student** model (standard transformer, no Mamba-2) on that data via SFT. Deploy the student on GB10.
+## Why This Works
+| Component | Tool | SM 12.1 | Status |
+|-----------|------|---------|--------|
+| Teacher generation | llama.cpp | ✅ | Proven — 4.35/5 eval score |
+| Reward filtering | Python (CPU) | ✅ | No GPU needed |
+| Student training | PyTorch + HF | ✅ | Standard transformer, no custom kernels |
+| Student inference | llama.cpp or vLLM | ✅ | Both work on GB10 |
+No mocks. No CUDA kernel workarounds. Every component is production-ready on GB10.
+## Pipeline
+```
+Phase 1: Data Generation
+    Nemotron-4B (llama.cpp, BF16 GGUF)
+    + Interview prompts from dataset (7,580 segments)
+    + System prompt: "You are Lex Fridman, an AI interviewer"
+    → Generate 5-10 completions per prompt
+    → ~50K raw completions
+Phase 2: Filtering
+    Score each completion with reward function
+    + Question quality (asks insightful questions?)
+    + Brevity (concise, not lecturing?)
+    + Relevance (follows conversation?)
+    + Style (sounds like an interviewer?)
+    → Keep completions scoring ≥ 4/5
+    → Target: 5K-10K high-quality examples
+Phase 3: Student Training
+    Pick student model (see model selection below)
+    SFT on filtered dataset
+    → Standard HF Trainer, no special kernels needed
+Phase 4: Evaluation
+    Run eval suite against student model
+    Compare with Nemotron-4B base (4.35/5 target)
+    Iterate on data quality / student model
+```
+## Model Selection for Student
+### Criteria
+- Standard transformer architecture (no Mamba, no custom CUDA)
+- ~4B parameters (fits on GB10 with room for training)
+- Good instruction-following base
+- Works with PyTorch on SM 12.1
+### Candidates
+| Model | Params | Arch | GRPO-trainable on GB10? | Notes |
+|-------|--------|------|------------------------|-------|
+| Qwen3.5-4B | 4B | Transformer | ✅ Yes (tested vLLM + PyTorch) | MoE routing sensitive to quant |
+| Qwen3.5-3B-A0.6B | 3B (0.6B active) | MoE | ✅ Yes | Very fast inference |
+| Llama-3.2-3B | 3B | Transformer | ✅ Yes | Proven architecture |
+| Gemma-3-4B | 4B | Transformer | ✅ Yes | Strong multilingual |
+| Phi-4-mini (3.8B) | 3.8B | Transformer | ✅ Yes | Good reasoning |
+**Recommendation**: Start with **Qwen3.5-4B** — same size as Nemotron, standard transformer, already tested on our eval suite (scored 3.55/5 at Q8, potential to improve with targeted SFT).
+### Why Not Just SFT Nemotron?
+We already tried SFT on Nemotron-4B directly (v1-v5). Best score: 3.20/5. The Mamba-2 architecture made LoRA ineffective (only 4 attention layers got meaningful updates). SFT distillation to a standard transformer lets LoRA/full-finetune work on ALL layers.
+## Data Generation Details
+### System Prompt
+```
+You are an expert AI interviewer in the style of Lex Fridman. You ask
+thoughtful, probing questions that explore deep ideas. Your questions are:
+- Concise (under 50 words)
+- Open-ended (encourage the guest to think deeply)
+- Build on what the guest just said
+- Occasionally surprising or from unexpected angles
+Do not lecture. Do not summarize. Just ask the next question.
+```
+### Prompt Format
+Each prompt is a conversation turn from our interview dataset:
+```
+Guest: [previous guest response from interview_segments_v2.jsonl]
+Interviewer:
+```
+### Generation Parameters
+```yaml
+temperature: 0.8      # some diversity
+top_p: 0.95
+max_tokens: 200
+n_per_prompt: 8       # 8 completions per prompt for diversity
+```
+### Reward Function (for filtering)
+```python
+def score_completion(prompt, completion):
+    score = 0
+    # Must ask a question
+    if "?" in completion: score += 1
+    # Brevity (under 50 words is excellent)
+    words = len(completion.split())
+    if words < 30: score += 1.5
+    elif words < 50: score += 1.0
+    elif words < 80: score += 0.5
+    elif words > 150: score -= 1.0
+    # No template/meta patterns
+    if not any(p in completion.lower() for p in
+        ["用户问", "user asks", "the user", "question:", "as an ai"]):
+        score += 0.5
+    # Ends with a question
+    if completion.strip().rstrip().endswith("?"):
+        score += 1.0
+    # Single question (not a list)
+    if completion.count("?") <= 2:
+        score += 0.5
+    # Doesn't start with meta-commentary
+    first_word = completion.strip().split()[0].lower() if completion.strip() else ""
+    if first_word not in ["sure", "great", "absolutely", "definitely", "certainly"]:
+        score += 0.5
+    return min(5.0, max(0.0, score))
+```
+**Keep threshold: score ≥ 3.5** (top ~30-40% of completions)
+## Training Details
+### SFT Configuration
+```yaml
+model: Qwen/Qwen3.5-4B  # or chosen student
+epochs: 3
+batch_size: 4
+learning_rate: 2e-5
+max_length: 512
+warmup_steps: 100
+weight_decay: 0.01
+lora:
+  rank: 128
+  alpha: 256
+  target: all-linear
+```
+### Estimated Resources
+- Data generation: ~50K completions × 200 tokens × ~30 tok/s = ~90 hours (can parallelize)
+  - OR: reduce to 10K completions = ~18 hours
+- Filtering: minutes (CPU)
+- SFT training: ~2-4 hours on GB10 (proven from prior runs)
+### Speedup: Reduce Generation Scope
+Instead of all 7,580 prompts × 8 generations = 60K:
+- **Phase 1**: 500 prompts × 8 = 4,000 completions (~7 hours)
+- Filter → ~1,500 high-quality examples
+- SFT → evaluate
+- **Phase 2**: If promising, scale to full dataset
+## Success Criteria
+- Student model eval score ≥ 4.0/5 (approaching Nemotron base's 4.35)
+- Completion quality: concise questions, no lecturing
+- Consistent interviewer persona
+## Timeline
+| Phase | Time | Output |
+|-------|------|--------|
+| Generate 4K completions | ~7 hours | raw completions |
+| Filter + format | ~30 min | SFT dataset |
+| SFT training | ~3 hours | trained student |
+| Evaluation | ~1 hour | eval scores |
+| **Total Phase 1** | **~12 hours** | **first result** |
+## Risks
+| Risk | Mitigation |
+|------|------------|
+| Student can't match teacher quality | Try multiple student models; increase data |
+| Teacher generates repetitive data | High temperature + diverse prompts |
+| Reward function too noisy | Manual review of 100 samples first |
+| SFT overfits to teacher quirks | Early stopping, validation split |

docs/RETROSPECTIVE_2026-03-31.md ADDED Viewed

	@@ -0,0 +1,168 @@

+# Lex Fridman Interviewer — Data-Driven Retrospective
+*2026-03-31 | All experiments evaluated on held-out eval (n=25 prompts)*
+---
+## Final Leaderboard (functional judge: on_topic × uses_guest × probing)
+| Rank | Model | Score | Δbase | on_topic | uses_guest | probing | Words |
+|------|-------|-------|-------|----------|------------|---------|-------|
+| 🥇 | **GRPO v12** (LR=5e-6, cosine, 200 steps) | **0.760** | +0.107 | 72% | 60% | 96% | 15.2 |
+| 2 | GRPO v13 (LR=2e-5, constant, 300 steps) | 0.773 | +0.120 | 88% | 52% | 92% | 16.4 |
+| 3 | LoRA v1 (r=64, LR=2e-4, 1ep) | 0.733 | +0.080 | 72% | 56% | 92% | 14.8 |
+| 4 | GRPO v14 (LR=1e-5, constant, 300 steps) | 0.707 | +0.054 | 68% | 52% | 92% | 15.5 |
+| 5 | Base Nemotron 4B | 0.653 | — | 64% | 48% | 84% | 14.5 |
+| 6 | SFT v5 (LoRA r≈16, 3ep) | 0.667 | +0.014 | 76% | 60% | 64% | 15.2 |
+**Best balanced model: GRPO v12 (0.760)** — highest uses_guest (60%) + probing (96%).
+GRPO v13 has higher raw score (0.773) via on_topic (+16pp) but uses_guest regressed.
+---
+## Math: Score Decomposition
+The functional score is a per-prompt product:
+```
+score_i = on_topic_i × uses_guest_i × probing_i  ∈ {0, 1/3, 2/3, 1}
+Score   = mean(score_i)                            ∈ [0, 1]
+```
+Under **independence assumption**, `E[score] ≈ P(OT) × P(UG) × P(PR)`.
+But the observed score / product ratio reveals **co-occurrence structure**:
+| Model | Observed | P(OT)×P(UG)×P(PR) | Ratio | Meaning |
+|-------|----------|-------------------|-------|---------|
+| Base | 0.653 | 0.258 | **2.53** | Strong co-occurrence |
+| SFT v5 | 0.667 | 0.292 | **2.28** | Strong co-occurrence |
+| LoRA v1 | 0.733 | 0.371 | **1.98** | Moderate co-occurrence |
+| GRPO v12 | 0.760 | 0.415 | **1.83** | Near-independence |
+| GRPO v13 | 0.773 | 0.421 | **1.84** | Near-independence |
+| GRPO v14 | 0.707 | 0.325 | **2.17** | Moderate co-occurrence |
+**Insight**: Ratio converging toward 1.0 means the three dimensions are becoming more
+independent per-prompt. GRPO v12/v13 are closest to the theoretical maximum given their
+marginal rates. There is minimal slack left from co-occurrence effects — further gains
+must come from improving the marginal rates themselves.
+---
+## Math: Conditional Probabilities
+**Key structural finding:**
+```
+P(UG=T | OT=T) >> P(UG=T | ¬OT=T)  for ALL models
+```
+| Model | P(UG | OT) | P(UG | ¬OT) | Lift | UG baseline |
+|-------|-----------|------------|------|-------------|
+| Base | 68.75% | 11.11% | 1.43× | 48% |
+| LoRA v1 | 72.22% | 14.29% | 1.29× | 56% |
+| GRPO v12 | **77.78%** | 14.29% | 1.30× | 60% |
+| GRPO v13 | 59.09% | 0.00% | 1.14× | 52% |
+| GRPO v14 | 76.47% | 0.00% | 1.47× | 52% |
+**The OT→UG transition is the only path.** No model achieves UG=True when OT=False.
+This means: improving uses_guest requires first getting on-topic, then adding specificity.
+Training that gains OT (v13) but loses UG has taught the model to be on-topic generically.
+---
+## GRPO Training Signal Analysis
+### Formula: GRPO gradient signal strength
+```
+signal_proxy = LR × mean(reward_std)
+GRPO advantage for completion i:
+  A_i = (r_i - mean(r_j)) / std(r_j)
+Policy gradient update:
+  ΔW ∝ LR × Σ A_i × ∇log P(completion_i | prompt)
+```
+| Run | LR | mean(reward_std) | signal_proxy | reward Δ | eval Δ | efficiency |
+|-----|-----|-----------------|--------------|----------|--------|------------|
+| GRPO v12 | 5e-6 | 0.158 | **7.9e-7** | +0.113 | **+0.027** | 0.24 |
+| GRPO v13 | 2e-5 | 0.188 | **3.8e-6** | +0.088 | +0.013 | 0.15 |
+| GRPO v14 | 1e-5 | 0.173 | **1.7e-6** | +0.092 | −0.053 | **−0.58** |
+**Signal-to-eval efficiency (eval Δ / reward Δ)** reveals how well reward improvements
+translate to actual quality gains:
+- v12: 0.24 — best efficiency despite lowest signal strength
+- v13: 0.15 — 4× stronger signal, but less efficient (LR overshot uses_guest)
+- v14: −0.58 — reward improved but eval degraded (reward_v13 misaligned)
+### v14 paradox: entropy ↑ but eval ↓
+GRPO v14 showed the largest entropy increase (+0.72 nats, +46%) and KL decrease (−0.69).
+This combination means: model explored more diverse phrasings AND moved back toward the
+reference distribution. Yet eval dropped from 0.760 → 0.707.
+**Diagnosis**: reward_v13 (geometric mean OT×UG×PR) was harder to optimize than reward_v12.
+The geometric mean penalizes ANY weak dimension equally — if a rollout has OT=high but
+UG=medium and PR=high, the geometric mean pulls the reward down more than v12's
+asymmetric formula would. The model's entropy increased (exploring) but the exploration
+wasn't converting to better joint OT∧UG outcomes.
+---
+## Per-Prompt Analysis: Hard Ceiling
+Across all 6 models on 25 prompts:
+- **18/25 prompts (72%)** have been answered with OT∧UG=True by at least one model
+- **7/25 prompts (28%)** are structurally hard — no model has ever achieved OT∧UG=True
+Hard prompts share a pattern: vague/content-light guest statements where there's
+insufficient specific content to reference ("weasel and furry coats", "no wars when
+I was president", "that's right exactly right").
+**Theoretical ceiling**: If we achieve OT∧UG=True on all 18 reachable prompts with
+probing=92%, the max score is: `18/25 × 0.92 + 7/25 × 0 ≈ 0.662`. Yet GRPO v12
+achieves 0.760 — above this. The discrepancy is explained by the ratio analysis: the
+actual score includes probing-only wins (OT=F, UG=F, PR=T) which add partial credit.
+**Corrected ceiling** (with probing-only partial credit):
+```
+max_score ≈ P(all 3) × 1.0 + P(exactly 2) × (2/3) + P(exactly 1) × (1/3)
+```
+GRPO v12 is near-optimal for its current marginal rates.
+---
+## Why uses_guest Plateaus at 52-60%
+The OT∧UG joint rate has been stuck at 13-14/25 across all GRPO runs:
+| Model | OT∧UG joint | Gained (vs base) | Lost (vs base) |
+|-------|------------|-----------------|----------------|
+| LoRA v1 | 13/25 | +5 prompts | −3 prompts |
+| GRPO v12 | **14/25** | +5 prompts | −2 prompts |
+| GRPO v13 | 13/25 | +4 prompts | −2 prompts |
+| GRPO v14 | 13/25 | +5 prompts | −3 prompts |
+All GRPO runs gain exactly 4-5 prompts and lose 2-3. The **net is always +1 to +2**.
+This is a hard architectural constraint: the 4 attention layers (LoRA targets, 1% of params)
+can shift the marginal distribution by ~1 prompt worth of improvement per run.
+The 38 frozen Mamba layers hold the deeper "interview style" patterns. To break past
+14/25 OT∧UG, we would need either:
+1. Full fine-tuning (all 38 Mamba layers + 4 attention)
+2. More LoRA capacity (r=256 or above)
+3. A fundamentally different base model
+---
+## Conclusion
+**Best model for deployment: GRPO v12 (score=0.760)**
+The progression LoRA v1 → GRPO v12 → (plateau) tells a clear story:
+- LoRA v1 taught the style (SFT signal)
+- GRPO v12 refined the balance (RL signal, low LR, let reward accumulate)
+- Further GRPO iterations hit the 1% LoRA architectural ceiling
+The reward function is not the limiting factor. The training signal is not the limiting
+factor. The **LoRA rank and frozen Mamba architecture** are.
+To improve beyond 0.760 would require r=256+ LoRA or full fine-tuning — a different experiment.

docs/REWARD_V10_DESIGN.md ADDED Viewed

	@@ -0,0 +1,175 @@

+# Reward v10 Design Doc — Log-Ratio + NLI Hybrid
+**Status:** Design phase — experiments pending
+**Goal:** Verifiable, ungameable reward for Lex Fridman interviewer style transfer
+**Decision threshold:** Signal strength test must pass before v9 launch
+---
+## Architecture
+```
+reward(q, guest) = hard_gates(q)
+                    × (1.0 + r_depth + r_lex + r_brevity + r_specificity)
+Where:
+  r_depth       ← NLI entailment score (0–1.5)
+  r_lex         ← SFT log-ratio (0–1.5, clipped)
+  r_brevity     ← word count score (0–0.5)
+  r_specificity ← entity overlap with guest (0–0.5)
+  Total range   ← 0.0 (gated) or 1.0–5.0
+```
+---
+## Component 1: Hard Gates (binary, pre-filter)
+Fast regex checks. Instant 0.0 if ANY fires.
+| Gate | Pattern | Rationale |
+|------|---------|-----------|
+| Stage directions | `^\*\(`, `^\[Lex` | Model never generates these in real interviews |
+| Meta-commentary | `as lex fridman`, `as an interviewer` | Identity collapse |
+| Filler openers | `that's fascinating`, `great point` | Lex never starts with filler |
+| Ultra-generic | `what are your thoughts on that` | Not specific to guest |
+| Not a question | doesn't end with `?` | Lex always asks |
+| Too many `?` | count > 2 | Signals multiple weak questions |
+**Assumption:** These patterns are unambiguous and cannot be worked around while still producing good questions.
+**Test:** Run gates on 200 random Lex real questions → expect <1% false positive rate.
+---
+## Component 2: NLI Depth (r_depth, 0–1.5)
+Model: `cross-encoder/nli-deberta-v3-small` (frozen, 86MB)
+```
+premise  = guest_answer
+hypothesis = candidate_question
+entailment   → r_depth = max(0, 0.5 - conf × 0.8)   # shallow restate
+neutral      → r_depth = 0.5 + conf × 0.5            # goes beyond
+contradiction → r_depth = 0.8                         # strong contrast
+```
+**Assumption:** A question that is NOT entailed by the guest statement requires genuinely new thinking.
+**Weakness:** Neutral class is broad — a random unrelated question also scores neutral.
+**Mitigation:** Specificity score (below) penalizes unrelated questions.
+**Test:** Score 50 real Lex questions vs 50 generic questions. Expected: real Lex mean > generic mean by >0.3.
+---
+## Component 3: SFT Log-Ratio (r_lex, 0–1.5)
+**What it is:**
+Train a small SFT model on (guest_context → Lex_question) pairs for 1 epoch.
+Then: `r_lex = clip(sft_lp(q|ctx) - base_lp(q|ctx), 0, 1.5)`
+**Why log-ratio vs raw SFT logprob:**
+Raw logprob favors short/common sequences regardless of quality.
+Log-ratio normalizes by the base model → measures "Lex-specific" signal.
+**Assumptions:**
+1. SFT model learns a distribution shift toward Lex style
+2. `sft_lp - base_lp > 0` reliably for real Lex questions
+3. The signal is large enough to be useful (>0.1 per sequence)
+**Tests to run:**
+- A. False positive rate: does log-ratio correctly score bad questions low?
+- B. Signal strength: mean(sft_lp - base_lp) for real Lex vs generic questions
+- C. Mode coverage: does the SFT model assign high prob to diverse good questions (not just mode)
+**Risk:** If SFT only shifts by 0.01 nats/token after 1 epoch, signal is below noise → useless.
+**Mitigation:** Test first; if weak, train 3 epochs or use a Qwen 0.5B for the SFT model.
+---
+## Component 4: Specificity (r_specificity, 0–0.5)
+Word overlap between question and guest statement using content words (>5 chars, non-stopwords).
+```
+specific_overlap = |{w: len(w)>5} ∩ guest_words| / |{w: len(w)>5} ∩ guest_words|
+r_specificity = min(specific_overlap × 2.0, 0.5)
+```
+**Assumption:** A question that references the guest's specific terminology is more contextually grounded.
+**Weakness:** Penalizes questions that pivot to a NEW topic not in guest's statement (sometimes Lex's best moves).
+**Mitigation:** Keep weight low (0.5 max) so pivot questions still score well on NLI.
+---
+## Component 5: Brevity (r_brevity, 0–0.5)
+```
+10–40 words → +0.5
+8–60 words  → +0.25
+>100 words  → −0.3
+```
+**Assumption:** Lex's best questions are concise. Verbosity ≠ quality.
+---
+## Experiment Plan (small scale)
+### Exp 1: Hard Gate False Positive Rate
+- Input: 200 real Lex questions from transcripts
+- Expected: <2 false positives (>99% pass rate)
+- Failure: >5% false positive → gates need loosening
+### Exp 2: NLI Discrimination
+- Input: 50 real Lex questions + 50 ChatGPT generic follow-ups, same guest statements
+- Metric: Mann-Whitney U test on NLI scores
+- Expected: p < 0.01, Lex mean 0.3+ higher
+- Failure: no significant separation → NLI not useful for this task
+### Exp 3: SFT Model Signal Strength
+- Train: 1-epoch SFT on 10K (context, question) pairs from Lex transcripts
+- Input: 50 real Lex questions + 50 generic, per-token log-ratio
+- Metric: mean(log-ratio for Lex) vs mean(log-ratio for generic)
+- Expected: Lex mean > generic mean by >0.05 nats/token
+- Failure: <0.02 difference → signal too weak, train longer or use different base
+### Exp 4: Gaming Resistance
+- After Exp 3: sample 32 completions from base model for 5 prompts
+- Run through reward_v10 (all components)
+- Check: does reward correctly rank the 32 completions?
+- Manually label top-5 and bottom-5 → does reward agree with human judgment?
+### Exp 5: Mini GRPO (50 steps)
+- 10 prompts, 4 generations each, 50 steps
+- Reward: v10 (full combination)
+- Metric: reward trend + val_score at step 50
+- Expected: rewards climbing without obvious collapse patterns
+- Failure: collapse at step 20 → reward still gameable
+---
+## Tradeoff Summary
+| | Heuristic v8 | NLI v9 | Log-ratio | Combined v10 |
+|--|--|--|--|--|
+| **Ground truth** | No | Partially | Yes (real data) | Yes |
+| **Speed** | 0.0s | 0.5s | 0.1s | 0.6s |
+| **Gameable** | Yes (step 50) | Partially | Mode-seeking | Harder |
+| **Interpretable** | Yes | Partially | Yes | Yes |
+| **Circular** | No | No | No (frozen) | No |
+| **Effort to build** | Done | Done | 1h SFT + test | 2h |
+---
+## Go/No-Go Decision
+Proceed with v10 if Exp 2 + Exp 3 both pass.
+If Exp 3 fails (weak signal): drop log-ratio, proceed with NLI-only v9.
+If Exp 2 fails: reconsider NLI model or switch to sentence embedding distance.
+---
+*Author: vexorium | Date: 2026-03-29*

docs/REWARD_V11_DESIGN.md ADDED Viewed

	@@ -0,0 +1,187 @@

+# Reward v11 Design Doc — Simulated Response Information Gain
+**Status:** Design phase
+**Motivation:** v10 log-ratio is a stochastic parrot — rewards questions that sound like Lex
+rather than questions that *work* like Lex. A good interview question makes the guest
+say something they wouldn't have said otherwise. That's measurable.
+**Decision to pivot:** v10 val_score=6.70 at step 10 beat base (6.46), but the reward
+signal measures style match, not function. A model optimizing v10 converges on Lex's
+modal questions, not his best ones.
+---
+## Core Idea
+```
+reward(q | guest_context) = does q make the guest elaborate beyond what they already said?
+```
+Concretely:
+1. Policy generates question Q given guest statement G
+2. Frozen guest simulator generates response R to (G, Q)
+3. Measure: how much NEW, RELEVANT content does R contain vs G?
+---
+## Information Gain Formula
+```python
+novelty   = 1 - cosine_sim(embed(R), embed(G))   # R goes beyond G
+relevance = cosine_sim(embed(R), embed(Q))         # R actually addresses Q
+info_gain = novelty × relevance
+```
+**Why the product?**
+- Random/unrelated question: novelty HIGH (R goes somewhere new) but relevance LOW → near 0
+- Restatement question ("So you mean X?"): novelty LOW (R just confirms G) → near 0
+- Deep specific question: novelty HIGH AND relevance HIGH → high reward
+The product forces BOTH conditions simultaneously. Neither alone is sufficient.
+---
+## Guest Simulator
+**Model:** `Qwen/Qwen2.5-0.5B-Instruct` (already loaded, frozen)
+**Why a different model than the policy:**
+- Policy is Nemotron 4B (Mamba hybrid)
+- Guest sim is Qwen 0.5B — different architecture, different weights
+- No gradient connection between them
+- Frozen throughout training
+**Why it doesn't need to be good:**
+The guest sim only needs to respond DIFFERENTLY to different questions.
+GRPO compares questions within a group (leave-one-out advantage).
+If sim responds more elaborately to Q1 than Q2, Q1 scores higher — that's the signal.
+Absolute quality of simulated responses doesn't matter, relative ordering does.
+**Prompt format:**
+```
+System: You are an expert being interviewed. Give a substantive, specific answer
+        to the follow-up question based on what you just said.
+User:   [guest_statement]
+        Follow-up question: [question]
+Assistant: [response]
+```
+**Generation params:** max_new_tokens=150, temperature=0.7, greedy fallback
+---
+## Embedding Model
+`sentence-transformers/all-MiniLM-L6-v2` (384-dim, already used in project)
+- Fast: ~0.1ms per embedding on GPU
+- Reasonable semantic similarity
+---
+## Full Reward
+```python
+def reward_v11(q, guest, sim_response):
+    # Hard gates (unchanged from v10)
+    if not passes_hard_gates(q):
+        return 0.0
+    # Embeddings
+    e_G = embed(guest)        # original guest statement
+    e_Q = embed(q)            # question
+    e_R = embed(sim_response) # simulated response
+    # Core: information gain
+    novelty   = 1 - cosine_sim(e_R, e_G)      # 0–1
+    relevance = cosine_sim(e_R, e_Q)            # 0–1
+    info_gain = novelty * relevance             # 0–1
+    # Secondary: brevity, specificity (small weight)
+    brevity     = brevity_score(q)             # 0–0.5
+    specificity = specificity_score(q, guest)  # 0–0.3
+    # Final
+    reward = 1.0 + info_gain * 2.5 + brevity + specificity  # 1.0–4.3
+    return clip(reward, 0.0, 5.0)
+```
+---
+## Speed Analysis
+| Component | Time (32 completions) | Notes |
+|-----------|----------------------|-------|
+| Hard gates | ~0.01s | Regex |
+| Guest sim generation | ~8–15s | Qwen 0.5B, 150 tokens, batched |
+| Embeddings | ~0.5s | MiniLM, batched |
+| **Total overhead** | **~15s** | vs current 0s |
+| Step time impact | +8% | step currently ~150s |
+Acceptable.
+---
+## Assumptions
+1. **Guest sim responds differently to different questions** — verified by design (causal LM)
+2. **Embedding cosine distance captures "new information"** — approximate but consistent
+3. **Product(novelty, relevance) is ungameable** — a model must produce questions that
+   both go beyond G AND stay relevant to G; neither condition alone suffices
+4. **Qwen 0.5B is stable as frozen reference** — no drift risk (frozen)
+---
+## Failure Modes
+| Risk | Mitigation |
+|------|-----------|
+| Guest sim always gives same length response | Check variance in response length across questions |
+| Embedding space doesn't capture deep vs shallow | Validate on known good/bad pairs (Exp 1) |
+| Sim response too short to embed meaningfully | Min length gate: sim_response >= 20 words |
+| Policy learns to ask confusing questions (high novelty, low relevance) | Relevance term in product prevents this |
+---
+## Experiment Plan
+### Exp A: Guest Sim Variance Check
+- 5 guest statements × 5 questions each (mix good/bad)
+- Check: does sim_response vary meaningfully across questions?
+- Expected: std(len(responses)) > 20 tokens within same guest
+### Exp B: Info Gain Discrimination
+- 30 real Lex questions vs 30 generic questions (same guests)
+- Run through full reward_v11
+- Expected: real Lex mean > generic mean (p < 0.05)
+- If fails: embedding model is wrong choice
+### Exp C: Mini GRPO 50 steps
+- Same setup as v10 mini run
+- Watch for collapse and val_score trajectory
+- Success: val_score > 6.46 (base) at step 30 without collapse
+---
+## Go/No-Go
+- If Exp A fails: guest sim model needs to be larger or prompted differently
+- If Exp B fails: pivot embedding model or add NLI component
+- If both pass: launch full 500-step run
+---
+## What This Optimizes vs v10
+| | v10 (log-ratio) | v11 (info gain) |
+|--|--|--|
+| Optimizes for | Looking like Lex | Working like Lex |
+| Stochastic parrot? | Yes | No |
+| Grounded in? | Lex's distribution | Guest's elaboration |
+| Gameable by? | Mode-seeking | Nothing cheap |
+| Cost | 0.1s/step | +15s/step |
+---
+*Author: vexorium | Date: 2026-03-29*

docs/REWARD_V13_DESIGN.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# Reward v13 Design — Guest-Grounded, Anti-Meta, Anti-Generic
+## Why v13 exists
+GRPO v21 and v22 taught two different lessons:
+- **v21** had the best thinking-enabled eval score (**0.867**) but still suffered from clipped / meta-spill failures.
+- **v22** greatly reduced clipping, but final eval dropped to **0.813**.
+This suggests the bottleneck is no longer primarily generation length. The next bottleneck is the **reward geometry**.
+## Failure modes observed
+### v21 eval outputs
+- explicit meta-spill: **12%**
+- generic/opening-style questions: **8%**
+### v22 eval outputs
+- explicit meta-spill: **4%**
+- generic/opening-style questions: **16%**
+- occasional off-domain drift remained
+Interpretation:
+- v22 solved more of the truncation problem
+- but shifted toward a cleaner, shorter, more generic local optimum
+## Design goals
+1. Keep the strong parts of v12:
+   - `uses_guest`
+   - `probing`
+2. Penalize obvious meta/task-restatement failures
+3. Penalize weak generic-opener templates when not lexically anchored to the guest
+4. Add a **soft overthinking penalty**
+5. Do **not** reward long hidden thinking directly
+## Core formula
+```python
+base = uses_guest**0.67 * probing**0.33
+reward = min(base + lexical_bonus, 1.0) * soft_penalty
+```
+### Base semantic reward
+- `uses_guest`: primary bottleneck
+- `probing`: secondary guardrail
+### Lexical bonus
+Small additive bonus for vocabulary echo from the guest statement.
+### Soft penalty
+Multiplicative penalty for:
+- explicit meta spill (`the user is asking...`, `I need to...`)
+- generic openers with weak lexical anchoring
+- obvious drift patterns
+- excessively long hidden thinking
+## Why not reward long thinking?
+Because v22 already showed that reducing clipping and allowing longer scratchpads did **not** automatically improve the final interviewer policy. Long thinking helps only if it improves the visible question. Therefore:
+- allow sufficient budget (`MAX_NEW_TOKENS=2560`, `MAX_SEQ=4096`)
+- but do not give extra reward for consuming it
+- instead, mildly discourage very long hidden reasoning once it goes past a healthy band
+## Soft overthinking penalty schedule
+Current intended schedule (based on hidden think token count):
+- <= 900 tokens: no penalty
+- > 900: ×0.95
+- > 1400: ×0.85
+- > 2000: ×0.70
+This is intentionally soft. The goal is not to suppress reasoning, only to discourage pathological rambling.
+## Proposed run policy
+For the next experiment, prefer a **clean reward ablation**:
+- same generation budget as **GRPO v21**
+  - `MAX_NEW_TOKENS=1600`
+  - `MAX_SEQ=3072`
+- start from **GRPO v21**, because v21 is still the best policy checkpoint
+This isolates the reward change better than changing both reward and token budget at once.
+## Challenge to the current plan
+The main temptation is to assume that the reward is being "hacked by long thinking" and therefore to either:
+- reward long thinking more, or
+- punish it aggressively
+Both are likely wrong.
+The better interpretation is:
+- long thinking changed the sampling dynamics
+- the reward still accepted a too-generic local optimum
+- therefore the next fix should be **better answer-side reward shaping**, not more hidden-thought optimization
+## Recommendation
+Best next experiment:
+- `reward_v13`
+- same generation length as **v21**
+- start from `grpo-v21`
+- keep clip penalty
+- compare directly against v21 under thinking-enabled eval

docs/RL_VS_FILTERING_ANALYSIS_2026-03-30.md ADDED Viewed

	@@ -0,0 +1,107 @@

+# RL vs Data Filtering — First Principles Analysis
+*2026-03-30*
+## The Question
+Can RL (GRPO) address the template contamination problem better than filtering the training dataset?
+## Root Cause (Recap)
+Template contamination: 30% of generated training data uses generic openers
+("How do you think/reconcile/balance...") that require no vocabulary echo.
+Real Lex uses these only 2% of the time. The model learned a prior:
+`P(first_token='How' | interview_context)` is very high.
+## Why RL Is Genuinely Better at This
+### Where the template is stored
+The prior `P("How do you"|context)` lives in the embedding + early Mamba-2 SSM layers.
+These carry the autoregressive state that says "I'm in an interview, generate a question."
+### SFT (filtering) mechanism
+- Cross-entropy on (guest, question) pairs
+- Updates 4 attention LoRA weights only (1.01% of params)
+- The Mamba SSM state that GENERATES "How" is **unchanged**
+- Filtering reduces how many times templates appear in training targets
+- The prior shrinks slightly but persists — you can't subtract probability with cross-entropy
+### GRPO mechanism
+```
+Loss = -A_i × log P(question_i | guest)
+A_i = (r_i - mean(r)) / std(r)
+```
+- Echo questions → A_i > 0 → P(echo tokens | guest) **increases**
+- Template questions → A_i < 0 → P("How do you think" | guest) **decreases**
+- This gradient flows through **all 42 layers** via backprop, including Mamba SSM
+- RL can SUBTRACT probability mass from templates — SFT cannot
+**Critical constraint**: GRPO weight updates still only apply to the 40.5M LoRA params.
+The gradient *signal* reaches all layers, but only attention weights *change*.
+So RL suppresses templates via attention modulation of Mamba output — indirect but effective.
+### The generation variance evidence
+The model already generates non-template openers ~8% of the time:
+- "because you mentioned..."
+- "you mentioned X..."
+- "you say the..."
+- "you mentioned 'regulatory'..."
+These are the seeds RL needs to amplify. GRPO will:
+1. Reward these when they appear (positive advantage)
+2. Penalize "How do you think" when competing (negative advantage)
+3. Shift the distribution from ~8% echo openers → ~30%+
+## Why Both Together Is Optimal
+| Approach | What it fixes | What it can't fix |
+|----------|--------------|-------------------|
+| Filtering | Removes positive template training signal | Can't subtract existing prior |
+| RL (GRPO) | Actively penalizes templates, amplifies echoes | Can't directly rewrite Mamba SSM weights |
+| **Both** | Data-level + optimization-level suppression | — |
+**Ordering**: Filter first → LoRA v2 SFT → then GRPO from v2 checkpoint.
+- Filtered SFT reduces template prior (weakened starting point for RL)
+- RL has more variance in rollout groups when ~15-20% echo vs ~8% from LoRA v1
+- GRPO has stronger signal to amplify echoes and suppress templates
+## Optimal Reward Function: reward_v12
+```python
+reward = geometric_mean(logit_gap_uses_guest, logit_gap_probing) × hard_gate
+logit_gap = log P(YES) - log P(NO)  # continuous, from Qwen3.5-4B judge
+```
+**Why geometric mean of logits (not binary product)**:
+- Continuous → dense gradient at every step (binary gives 0 when all True)
+- Product structure → can't game one dimension alone
+- Geometric mean → equal weight, penalizes weakest link
+- `uses_guest × probing` specifically: fixes bottleneck (56%) while protecting strength (92%)
+**Why not all 3 judges in reward**:
+- on_topic at 72% is not the bottleneck; adding it makes reward sparser
+- Start with 2-judge reward, add on_topic if it degrades during training
+**Why not reward_v11 (info-gain)**:
+- Experimentally anti-correlated with uses_guest (-0.098) and probing (-0.244)
+- Off-topic question scored higher than specific on-topic ones (Exp 2)
+- Historical: GRPO v11 dropped uses_guest -8pp (Exp 3)
+## Variance Analysis
+- Current LoRA v1: uses_guest=56%, probing=92%
+- Expected group variance (n=8): 0.20 (sufficient for GRPO)
+- uses_guest at 56% creates good rollout variance — some echo, some template
+- At 72%+ uses_guest, variance would shrink and RL signal weakens (natural convergence)
+## Recommended Execution Plan
+```
+Step 1: LoRA v2 SFT (filtered + upsampled, ~30 min)
+  - Remove 1,247 generic-opener generated pairs
+  - Upsample real Lex 6×
+  - Expected: uses_guest 56% → ~65%
+Step 2: GRPO from LoRA v2 checkpoint (reward_v12, ~3h)
+  - reward = geom_mean(logit_uses_guest, logit_probing) × gate
+  - Starting point: 0.73+ with ~65% uses_guest
+  - Expected: uses_guest 65% → 75%+, maintain probing ≥ 90%
+  - Full score target: 0.80+
+```

docs/SSM_SCAN_FIX_PLAN.md ADDED Viewed

	@@ -0,0 +1,202 @@

+# Plan: Implement Correct Mamba-2 SSD Scan for GB10
+> Date: 2026-03-25
+> Challenge: Replace NVIDIA's broken `torch_forward` with a correct pure-PyTorch SSD scan
+## Resources Found
+### 1. `vasqu/mamba2-torch` — Pure PyTorch Mamba-2 with correct SSD scan ⭐
+- **URL**: https://github.com/vasqu/mamba2-torch
+- **Key file**: `tests/ssd_minimal.py` — clean, correct `ssd_minimal_discrete()` function
+- Supports: Triton kernels, Triton-only, or **pure PyTorch** (toggle via `config.use_triton_kernels = False`)
+- Claims output logits match reference implementation
+- Uses `einsum` operations matching the paper's math
+### 2. `tommyip/mamba2-minimal` — Minimal Mamba-2 in PyTorch
+- **URL**: https://github.com/tommyip/mamba2-minimal
+- "The model's output logits follow the same distribution as the reference implementation but are not numerically equivalent"
+- Device agnostic (CPU, MPS, CUDA)
+### 3. `alxndrTL/mamba.py` — Simple Mamba implementation
+- **URL**: https://github.com/alxndrTL/mamba.py
+- Has mamba2.py (beta) — "numerically equivalent" to reference
+- Implements scan as sequential loop
+### 4. PyTorch issue #146129 — `torch.compile` on Mamba2 produces NaNs
+- Confirms numerical sensitivity of the SSD algorithm
+- Even `torch.compile` with Inductor breaks it
+## Root Cause Analysis
+NVIDIA's `torch_forward` in `modeling_nemotron_h.py` has a "naive ssd implementation" that differs from the reference `ssd_minimal_discrete` in several ways:
+1. **Manual reshape/permute/loop** instead of `einsum` — error-prone dimension handling
+2. **`segment_sum` implementation** may differ from the stable `segsum` in the reference
+3. **Inter-chunk state propagation** uses different decay formula
+4. **BF16 intermediate precision** — some operations stay in BF16 when they should be FP32
+The reference `ssd_minimal_discrete` uses:
+- `torch.einsum` for all contractions (clear, correct)
+- `segsum` with masked fill to `-inf` (numerically stable)
+- All operations in FP32 via `assert X.dtype == A.dtype` (enforced)
+## Step-by-Step Plan
+### Step 1: Port `ssd_minimal_discrete` to GB10
+- Copy the reference implementation from `vasqu/mamba2-torch`
+- Ensure it works standalone with correct shapes
+- **Deliverable**: `ssd_scan_correct.py` with working function
+### Step 2: Validate against A100 reference tensors
+- Extract Mamba layer inputs (x, dt, A, B, C) from the first Mamba layer
+- Run both NVIDIA's `torch_forward` scan and our correct scan
+- Compare outputs with A100 reference
+- **Deliverable**: Test showing correct scan matches A100 within tolerance
+### Step 3: Monkey-patch NemotronH to use correct scan
+- Replace the SSM scan portion of `torch_forward` with the correct implementation
+- Keep the rest (projection, conv1d, gated norm, out_proj) unchanged
+- Validate full model output matches A100 reference
+- **Deliverable**: Patch function that can be applied at model load time
+### Step 4: Validate perplexity and generation quality
+- Run perplexity test on standard text (should be 10-50, not 250-700)
+- Run Q8 vs BF16 mismatch test (should be >80% top-1 agreement)
+- Generate text and compare with llama.cpp output
+- **Deliverable**: Perplexity < 50, top-1 agreement > 70%
+### Step 5: Integrate into training pipeline
+- Update `grpo_v4_train.py` or `sft_train.py` to use the correct scan
+- Run SFT training and verify loss is reasonable
+- Run GRPO training and verify stable loss
+- **Deliverable**: Working training pipeline on GB10
+### Step 6: Full training run + evaluation
+- SFT on interview data (2000 samples, 3 epochs)
+- Evaluate against base model (target: improve on 4.35/5)
+- **Deliverable**: Trained LoRA adapter
+## Critical Finding: llama.cpp Uses Sequential Recurrence, NOT Chunked SSD
+**PR #18058** shows llama.cpp implements Mamba-2 as a **simple sequential recurrence** in `ggml_compute_forward_ssm_scan_f32`:
+```c
+// Per token, per head:
+dA = exp(softplus(dt) * A);        // scalar decay
+// Per state dimension:
+s_new = s_old * dA + B * x * dt;   // state update
+y = sum(s_new * C);                // output
+```
+This is:
+- **Numerically exact** — no cumulative sums, no segment sums, no exp of large values
+- **Simple** — ~30 lines of C, trivially portable to PyTorch
+- **The reference that works** — llama.cpp produces correct output
+NVIDIA's `torch_forward` uses the **SSD chunked algorithm** which is mathematically equivalent but numerically different. The chunked form computes `exp(cumsum(A))` which produces extreme values (A reaches -5569) causing underflow.
+**We should implement the sequential recurrence form, not the SSD chunked form.**
+## Key Insight
+The correct scan is **even simpler than `ssd_minimal_discrete`** — just a sequential loop. The hard part isn't the algorithm — it's correctly integrating it into NVIDIA's model code, which has:
+- Different tensor shapes and naming conventions
+- Group-to-head expansion logic
+- Conv1d and gated norm wrapping
+- Cache handling for generation
+## Phase A Results: BREAKTHROUGH ✅ (2026-03-25 04:20 UTC)
+### Correct Scan Implementation
+- Ported llama.cpp's `ggml_compute_forward_ssm_scan_f32` sequential recurrence to PyTorch
+- File: `ssm_scan_correct.py` — `ssm_scan_sequential()` function (~40 lines)
+- Monkey-patches `torch_forward` on all 21 Mamba layers
+### Validation Results
+| Metric | Old torch_forward | Correct Scan | Δ |
+|--------|------------------|-------------|---|
+| Perplexity | 250-700 | **9.02** | 30-70x better |
+| Q8 vs BF16 top-1 | 27% | **79%** | +52 pts |
+| Q8 vs BF16 top-5 | 43% | **90%** | +47 pts |
+| Training CE loss | 3.88 | **2.73** | -30% |
+| Loss decrease (1 step) | ~3.88→~3.88 | **2.73→2.71** | ✅ Converging |
+| Grad norm | 64 | **14** | ~5x lower |
+| Top predictions | `' '`, `'2'`, `','` | `' curiosity'`, `' empathy'`, `' the'` | Coherent! |
+### Key Findings
+1. The sequential recurrence (`s = s * dA + B * x * dt`) from llama.cpp produces **numerically correct** SSM scan outputs on GB10
+2. The chunked SSD in `torch_forward` uses `exp(cumsum(A))` which produces extreme values (A reaches -5569) causing catastrophic underflow in BF16 — this is what caused all previous training failures
+3. The corrected scan produces coherent word predictions matching llama.cpp's Q8 output
+4. Training with the corrected scan shows proper loss decrease and stable gradients
+## Updated Plan (2026-03-25 04:20 UTC)
+### Phase A: Sequential Recurrence (correctness first)
+1. Port llama.cpp's `ggml_compute_forward_ssm_scan_f32` to PyTorch (~20 lines)
+2. Monkey-patch into NemotronH's Mamba layers
+3. Validate against A100 reference tensors (target: >80% top-1 agreement)
+4. Validate perplexity (target: <50 on common English)
+5. If correct → proceed to Phase B
+### Phase B Results: TRAINING VALIDATED ✅ (2026-03-25 04:30 UTC)
+| Metric | Value | Status |
+|--------|-------|--------|
+| CE Loss | 2.73 | ✅ Normal (was 3.88-9.99 with old scan) |
+| Grad norm | 14.2 | ✅ Stable (was 64-1200) |
+| LoRA layers with gradients | 92 | ✅ All layers |
+| Loss after 1 optimizer step | 2.73 → 2.71 | ✅ Decreasing |
+### Phase C: SFT Training Plan
+**Architecture**: Pure PyTorch SFT (no llama.cpp needed for training)
+- Model: Nemotron-3-Nano-4B with corrected SSM scan
+- Training: HF Trainer + LoRA
+- Venv: `.venv-train` (transformers 4.48.3)
+- Monkey-patch all 21 Mamba layers at load time
+**Data**: `interview_segments_v2.jsonl` (7,580 segments)
+- Each segment: system prompt + guest answer + interviewer question
+- Format: chat template with `<|im_start|>` tags
+- Train/val split: 95% / 5%
+**Hyperparameters** (based on Kaggle notebook + our validated config):
+```yaml
+lora_rank: 64
+lora_alpha: 256
+target_modules: all-linear
+learning_rate: 2e-4
+lr_scheduler: cosine
+warmup_ratio: 0.1
+epochs: 3
+batch_size: 1
+gradient_accumulation: 4
+max_seq_length: 512
+bf16: true
+max_grad_norm: 1.0
+```
+**Estimated time**: ~3 min/step × 1500 steps = ~75 hours (sequential scan is slow)
+**Speed optimization**: If 75 hours is too long:
+- Reduce to 1000 samples × 1 epoch = ~250 steps = ~12 hours
+- Or implement `ssd_minimal_discrete` chunked version for parallelism
+- Or reduce max_seq_length to 256 (~halves time)
+**Evaluation**: After training, compare with base model (4.35/5) using the existing eval suite
+### Resources
+- llama.cpp sequential scan: `ggml/src/ggml-cpu/ops.cpp:9284` (ggml_compute_forward_ssm_scan_f32)
+- vasqu/mamba2-torch chunked scan: `tests/ssd_minimal.py` (ssd_minimal_discrete)
+- vLLM Mamba-2 on SM 12.x: confirmed working (vllm issue #34452)
+- A100 reference tensors: `reference/reference_tensors.pt`
+## Risk Assessment
+| Risk | Mitigation |
+|------|------------|
+| Correct scan still doesn't match A100 | We have reference tensors — iterate until match |
+| Performance too slow | Already accepted 3 min/step; correct > fast |
+| Integration breaks other layers | Only patch Mamba layers, leave attention/MLP untouched |
+| Memory issues | Same model, same memory — just different math |

docs/SYNTHETIC_DATA_ANALYSIS_2026-03-30.md ADDED Viewed

	@@ -0,0 +1,153 @@

+# Synthetic Data Strategy — Full Analysis
+*2026-03-30 | 5 experiments run*
+## The Question
+Can we generate more synthetic data to push `uses_guest` from 56% to 70%+ for LoRA v2?
+## Answer: No — but here's what will actually work.
+---
+## Experiments Run
+### Exp 1: reward_v11 correlation with uses_guest (n=25)
+- `uses_guest=True` questions score **-0.098 lower** on reward_v11
+- `probing=True` questions score **-0.244 lower** on reward_v11
+- **GRPO with reward_v11 is anti-correlated with both target dimensions**
+### Exp 2: Controlled reward_v11 test (same guest, 6 question types)
+- Off-topic question scored 2.613 vs specific on-topic at 2.446
+- reward_v11 doesn't penalize off-topic behavior
+- **Confirmed: info-gain (novelty × relevance) rewards genericity**
+### Exp 3: Historical — GRPO v11 result
+- 148 steps with reward_v11: uses_guest −8pp, probing −4pp
+- Exactly what the anti-correlation predicts
+### Exp 4: reward_v11 on LoRA v1 vs Base
+- LoRA v1 (better model, 0.733) scores **lower** on reward_v11 than base (0.653)
+- The reward can't even rank the models correctly
+### Exp 5: Echo-targeted prompt vs standard generation (n=60, judged) ← NEW
+- Hypothesis: "MUST reference specific words" prompt → more uses_guest in synthetic data
+- **Result: FAILED**
+|  | Standard | Echo | Delta |
+|--|---------|------|-------|
+| score | 0.611 | 0.500 | −0.111 |
+| on_topic | 63% | 50% | −13pp |
+| uses_guest | 43% | **35%** | **−8pp** |
+| probing | 76% | 64% | −12pp |
+| vocab overlap | 1.93 | 2.57 | +0.64 |
+Overlap goes up but quality goes down — the echo constraint causes trivial short questions
+("What is verification?", "What weight?") that have vocab overlap but fail all 3 judges.
+---
+## Root Cause: uses_guest Gap Is NOT a Data Volume Problem
+Training set already has 96% uses_guest=True labels (4,772 pairs).
+Model achieves only 56% on held-out eval. Gap = generalization issue, not coverage.
+Two explanations:
+1. **Label noise**: judge sees full guest paragraph, model trained on truncated (512 tok) input.
+   Overlap words present in full text but not in what model sees at training time.
+2. **Distribution shift**: model learned vocab-echo for in-distribution topics,
+   generalizes to question *style* but not vocabulary echo on new domains.
+**Either way: more data won't fix it.** Scaling model: +5,000 pairs → only +12pp.
+## Data Available (if needed)
+| Pool | Size | Notes |
+|------|------|-------|
+| score=1.0 (current training) | 4,772 | Already used |
+| score=0.667, overlap≥2 | 3,979 | Unused reservoir |
+| Transcript segments total | ~28,728 | 114 episodes |
+| Unused unique guest statements | 936 | Could generate new completions |
+## Paths That Will Actually Work
+### PATH 1: Loss-weighted SFT (RECOMMENDED FIRST) ← no new data
+Weight training loss by vocab overlap: `loss × (1 + 0.5 × overlap_count)`
+Forces model to pay more attention to high-overlap examples.
+No new data, no new generation, ~30 min.
+### PATH 2: Second-stage SFT on high-overlap subset ← no new data
+Filter to overlap≥3 pairs (1,579 total: 1,390 gen + 189 real).
+Fine-tune LoRA v1 for 1 more epoch on only these pairs.
+~10 min training.
+### PATH 3: New transcript sources ← genuinely new data
+Tim Ferriss, Dwarkesh Patel, Sean Carroll podcasts — same format, diverse domains.
+~2h to crawl + judge, could yield 5-10k high-quality pairs.
+Only needed if Paths 1+2 plateau below 65%.
+### PATH 4: reward_v12 (judge-as-reward GRPO) ← most powerful
+Use the 3 binary judges directly as GRPO reward.
+Directly optimizes the metric — no proxy mismatch.
+Starting from LoRA v1 (0.733 baseline).
+~3h to implement + test.
+## Decision Tree
+```
+Try PATH 2 (10 min) → eval
+  if uses_guest ≥ 65% → try PATH 1 too, then PATH 4
+  if uses_guest < 60% → PATH 2 didn't work, go to PATH 1 with higher weights
+If plateau at ~65% after 1+2:
+  Build PATH 4 (reward_v12) — this is the ceiling-breaker
+```
+## Do NOT Do
+- ✗ Echo-targeted generation (Exp 5: makes everything worse)
+- ✗ More of the same synthetic data (Exp 1-4: wrong signal, won't generalize)
+- ✗ GRPO with reward_v11 (anti-correlated with target dimensions)
+---
+## Deep Dive: Why New Podcast Sources Don't Help (Experiments 6-8, 2026-03-30)
+### Experiment 6: OOV Rate vs uses_guest Correlation
+**Question**: Do guests with vocabulary not seen in training cause uses_guest failures?
+- High-OOV guests (≥30% novel words, n=9): uses_guest = **56%**
+- Low-OOV guests (<30% novel words, n=16): uses_guest = **56%**
+- **Delta: -1%. Zero correlation.**
+Domain diversity does not predict success or failure. The model succeeds and fails equally on familiar and novel topics.
+### Experiment 7: Failure Mode Classification
+11 uses_guest=False cases broken down:
+- **Off-topic (6/11, 55%)**: on_topic=False — model asked about different subject entirely
+- **Generic-probing (5/11, 45%)**: on_topic=True, probing=True — right direction, wrong vocab
+Both failure types share the same pattern: generic openers ("How do you think/see/envision")
+that don't require referencing the guest's specific words.
+### Experiment 8: Template Contamination Discovery
+The smoking gun:
+| Source | Generic opener rate | Examples |
+|--------|-------------------|---------|
+| Real Lex (697 pairs) | **2%** | "Can you speak to...", "Do you think that's...", "What is..." |
+| Generated (4,075 pairs) | **30%** | 253× "How do you think", 198× "Why do you think", 159× "How do you reconcile" |
+The generated data (85% of training) teaches 5 near-universal templates.
+These openers work for ANY guest without referencing specific vocabulary.
+The model learned to use them universally → uses_guest fails on novel prompts.
+### Why Dwarkesh/Carroll Podcasts Won't Fix This
+1. The failure is template contamination, not domain coverage — OOV test proves it
+2. New podcasts would be generated the same way → same base-model template bias
+3. Mixed interviewer styles could dilute the real Lex signal
+4. **Exception**: REAL Dwarkesh questions (not generated) show vocab-echo naturally — but only if you use his actual words from transcripts, not model-generated imitations
+### Revised Solution: Filter + Upsample (no new data needed)
+- Remove 1,247 generic-opener generated pairs (26% of training set)
+- Upsample 697 real Lex pairs 6× → 4,182 effective examples
+- Real Lex weight: 15% → 54% of training signal
+- Expected: uses_guest 56% → 65-70%
+- Cost: zero new data, ~30 min training

docs/TECHNICAL_CHALLENGES.md ADDED Viewed

	@@ -0,0 +1,181 @@

+> **Status:** 📋 PRE-PROJECT — Technical risks identified before starting. To be updated as challenges are resolved or new ones emerge.
+# Technical Challenges — Lex Fridman AI Interviewer
+## P0: Must Resolve Before Full Pipeline
+### 1. Speaker Separation in Transcripts
+Existing HF datasets (`Drewd/lex_fridman_podcast_transcripts`, `Whispering-GPT/...`) are **flat text** — no speaker labels. We need to know which text is Lex and which is the guest.
+**Options:**
+| Approach | Accuracy | Speed | Cost |
+|----------|----------|-------|------|
+| Whisper + pyannote diarization | ~95% | Slow (real-time per episode) | Free (local) |
+| Heuristic splitting (short=Lex, long=guest) | ~80% | Fast | Free |
+| LLM-based splitting (Claude/GPT-4) | ~98% | Medium | ~$0.50/episode |
+**Mitigation:** Test diarization on 3 episodes first. If too slow, try LLM splitting on a batch.
+### 2. Mamba-2 + LoRA Compatibility
+Nemotron 3 Nano 4B has:
+- **Mamba-2 layers** (majority) — state-space model, NOT attention
+- **4 attention layers** only
+LoRA traditionally targets `q_proj, k_proj, v_proj, o_proj` — but Mamba-2 layers don't have these.
+**Key questions:**
+- Which Mamba-2 parameters does LoRA target? (`in_proj`, `out_proj`, `dt_proj`?)
+- Is the LoRA effective on Mamba layers, or does it only fine-tune the 4 attention layers?
+- If LoRA only hits 4/N layers, we might need full fine-tune instead
+**Mitigation:** Run a 5-minute SFT test with Unsloth. Check which layers get LoRA adapters. Inspect `model.print_trainable_parameters()`.
+---
+## P1: Must Validate Early
+### 3. Inverted Role Training
+All instruction-tuned models are trained: user asks → assistant answers. We're doing the opposite: assistant asks → user answers.
+**Risks:**
+- Model slips into "answering mode" and lectures instead of questioning
+- Model generates both the question AND the answer
+- Model refuses to ask questions and tries to be helpful instead
+**Mitigation:** Train on 100 examples, evaluate whether the model asks or answers. If SFT can't overcome the instruction-tuning prior, GRPO with a question-format reward may be needed.
+### 4. Model Freshness (Released 2 Days Ago)
+- Unsloth claims day-zero support but edge cases/bugs are likely
+- vLLM requires `>=0.15.1` for Nemotron 3
+- HF `transformers` support may need latest main branch
+- Custom reasoning parser needed for inference (`nano_v3_reasoning_parser.py`)
+- Community examples are near-zero — we're early adopters
+**Mitigation:** Verify Unsloth SFT runs end-to-end on 10 fake examples before committing to the full data pipeline.
+---
+## P2: Address During Development
+### 5. Evaluation is Harder Than Routangseng
+For routangseng we checked: does it start with a judgment? Does it have analogies? Simple heuristics worked.
+For an interviewer, "good question" is much more subjective:
+- Is "What is consciousness?" good for a neuroscientist? Yes. For a plumber? No.
+- Context-dependence makes heuristic eval harder
+- May need to rely more on LLM judge or human eval
+**Possible heuristics:**
+- Ends with `?` (asks a question, not a statement)
+- References the guest's last answer (listening)
+- Under 50 words (short questions = Lex style)
+- Doesn't repeat a previous question
+- Doesn't contain generic filler ("That's interesting, tell me more")
+### 6. Question Diversity Collapse
+Risk: the model learns 5-10 "Lex templates" and rotates through them:
+- "What does X mean to you?"
+- "Take me back to when you first..."
+- "What gives you hope?"
+This is the equivalent of routangseng's "用户问..." problem — a format the model falls into. Harder to detect heuristically.
+**Mitigation:** Monitor during eval. If detected, add GRPO with a diversity reward (penalize n-gram overlap with previous questions in the conversation).
+### 7. Multi-Turn Context Management
+Lex interviews are 2-3 hours = ~50K-100K tokens. Training constraints:
+- `max_seq_length=4096` only captures ~5 minutes of conversation
+- Longer sequences → more VRAM, slower training
+- But shorter segments lose the conversational arc (topic bridging, callbacks)
+**Tradeoff:** Train on short segments (3-8 turns, ~2K tokens), rely on 1M inference context for long-arc skills. The model may not learn long-arc techniques like topic bridging from short training segments — but the base model's 1M context + Lex's patterns in the data may be enough.
+### 8. Thinking Mode Interaction
+Nemotron 3 uses `<think>` token ID 12 and `</think>` token ID 13 (integer IDs, not text tokens like Qwen3.5).
+**Open questions:**
+- Should the model think before asking a question?
+- If yes, what should the thinking content look like? ("The guest just mentioned X, I should probe deeper on...")
+- If no, how do we suppress it without breaking the model?
+- Does the training data need `<think>...</think>` blocks?
+**Mitigation:** Test early — generate with thinking ON and OFF, compare question quality.
+---
+## Risk Mitigation: Day 1 Plan
+**Don't start the full data pipeline.** Instead, de-risk the two biggest unknowns:
+| # | Test | Time | What it tells us |
+|---|------|------|-----------------|
+| 1 | Download Nemotron 3 Nano 4B | 10 min | Verify it loads |
+| 2 | Unsloth SFT with 10 fake interview examples | 30 min | Does fine-tuning work end-to-end? |
+| 3 | Inspect LoRA targets on Mamba-2 layers | 5 min | How many layers actually get trained? |
+| 4 | Test speaker separation on 3 real Lex transcripts | 1 hour | Which diarization approach works? |
+| 5 | Generate 5 questions with fine-tuned model | 10 min | Does the model ask or answer? |
+If all 5 pass → proceed to full pipeline.
+If test 2 or 3 fail → evaluate fallback to full fine-tune or different base model.
+If test 4 fails → invest in LLM-based splitting.
+---
+## Challenge Resolution Log
+| Date | Challenge | Status | Resolution |
+|------|-----------|--------|------------|
+| 2026-03-19 | Speaker separation | ✅ Solved | lexfridman.com has official transcripts with `<span class="ts-name">` speaker labels. No diarization needed. |
+| 2026-03-19 | Mamba-2 + LoRA compatibility | ✅ Solved | Unsloth handles Mamba-2 LoRA automatically. 10.1M trainable params (0.38%). |
+| 2026-03-19 | Triton ptxas on Blackwell | ✅ Solved | Symlink system ptxas (CUDA 13.2) over Triton's bundled ptxas (CUDA 12.8): `ln -sf /usr/local/cuda/bin/ptxas /path/to/triton/backends/nvidia/bin/ptxas` |
+| 2026-03-19 | Python inference garbage | ✅ Workaround | Unsloth/transformers generation produces garbage on GB10. llama.cpp GGUF works perfectly (889 tok/s prompt, 53 tok/s gen). Use llama.cpp for all inference. |
+| 2026-03-19 | Model merge breaks Mamba | ✅ Confirmed | `merge_and_unload()` on Nemotron hybrid → garbage output. Keep LoRA separate, export to GGUF via Unsloth. |
+| 2026-03-19 | OOM during training | ✅ Solved | Caused by variable-length sequences in batch. Fix: measure actual token distribution, set max_seq_length to P100 + 10% buffer. |
+| 2026-03-19 | SFT worse than base | ⚠️ Open | SFT model (2.10/5) scored worse than base (4.35/5). Root cause under investigation — likely training data format or special token handling. |
+| 2026-03-19 | max_tokens for thinking | ✅ Solved | Nemotron/Gemini spend tokens on `<think>` reasoning. Must set max_tokens ≥ 800 to leave room for actual answer. |
+| 2026-03-19 | Inverted role training | ⚠️ Open | Not yet validated whether the model learned to ask (not answer). Need better eval after fixing inference path. |
+| 2026-03-23 | Off-policy GRPO fails | ❌ Failed | llama.cpp (base) generates completions, LoRA gets gradient updates — models are decoupled. LoRA diverges to gibberish after ~50 steps. See `docs/GRPO_V3_POSTMORTEM.md` for 6 identified gaps. |
+| 2026-03-23 | GGUF converter: NemotronH MoE | ✅ Workaround | `NemotronHConfig` has MoE defaults (`num_experts_per_tok=2`). Converter detects these and uses wrong architecture. Fix: set `num_experts_per_tok=0` in config.json + patch converter check to `> 0`. |
+| 2026-03-23 | Model merge → garbage | ✅ Confirmed | `merge_and_unload()` + GGUF export works mechanically, but GRPO v3 LoRA weights produce gibberish. This is a training failure, not a merge/export bug. |
+## Key Lessons Added
+### Eval-First Principle
+**Always run eval on the base model through the production inference pipeline BEFORE training.**
+We could have caught the Triton/Mamba inference issue in 5 minutes instead of spending hours training a model that couldn't generate.
+### Data-Driven Hyperparameters
+Don't pick round numbers. Measure the data distribution and pick values that fit.
+- `max_seq_length`: P100 of actual token lengths + 10% buffer, aligned to 64
+- `max_tokens` for inference: must account for thinking budget
+- `batch_size`: profile VRAM at target sequence length, leave 30% headroom
+### Infrastructure Reliability Hierarchy (for Blackwell/DGX Spark)
+1. **llama.cpp** — most reliable, NVIDIA-supported, works out of the box
+2. **Unsloth training** — works for SFT/GRPO with ptxas fix
+3. **Python inference** — broken for Mamba hybrid on this hardware, avoid
+### Off-Policy RL is Fundamentally Broken for Hybrid Architectures
+Do NOT attempt off-policy GRPO where the generator and learner are different models. Specific gaps identified in GRPO v3:
+1. **Off-policy generation** — generator (llama.cpp base) ≠ learner (HF + LoRA)
+2. **No KL reference** — `-β * mean(log_probs)` is NOT a KL divergence
+3. **Token truncation** — 512 max_length in forward pass vs 800 max_tokens in generation
+4. **Mamba architecture** — LoRA can't modify SSM recurrence dynamics (A, D, conv1d)
+5. **Thinking in training, not in reward** — gradient covers `<think>` tokens, reward ignores them
+6. **No credit assignment** — uniform token weighting, no per-token reward signal
+See `docs/GRPO_V3_POSTMORTEM.md` for the full analysis.
+*To be updated as new challenges are resolved.*
+---
+*Created: 2026-03-18*

docs/TECHNICAL_REVIEW_2026-03-23.md ADDED Viewed

	@@ -0,0 +1,205 @@

+# Technical Review — Assumptions That Led to Off-Policy GRPO
+> **Date:** 2026-03-23
+> **Purpose:** Retrospective on two false assumptions that forced us into the broken off-policy architecture, and what alternatives actually existed.
+---
+## The Decision Chain
+```
+GPU can't do autoregressive generation (PyTorch 2.9 + SM 12.1)
+    ↓
+"Only llama.cpp can generate" ← Assumption 1
+    ↓
+"Can't get gradients from llama.cpp" ← Assumption 2
+    ↓
+Off-policy GRPO: llama.cpp generates, HF model trains on those completions
+    ↓
+6 critical gaps → LoRA diverges to gibberish
+```
+Both assumptions were partially true but had alternatives we didn't explore.
+---
+## Assumption 1: "Only Unsloth/TRL exist for training"
+**What we believed:** HF transformers + Unsloth is the only viable training framework for Nemotron-3-Nano-4B. Since autoregressive generation produces garbage on GB10 (PyTorch 2.9 doesn't fully support SM 12.1), we can't do on-policy RL.
+**What we missed:**
+### NVIDIA NeMo RL
+NVIDIA's own open-source RL training library, purpose-built for Nemotron models.
+- **Supports:** GRPO, DAPO, SFT, DPO, RM, on-policy distillation
+- **On-policy GRPO:** Uses vLLM for generation within the training loop — same model generates and trains (fixes Gap 1)
+- **Proper KL divergence:** Built-in reference policy (fixes Gap 2)
+- **Architecture-aware:** Designed for Nemotron's hybrid Mamba-Transformer-MoE architecture
+- **Single node support:** Can run on 1 GPU (our setup)
+- **URL:** https://docs.nvidia.com/nemo/rl/latest/
+- **GitHub:** https://github.com/NVIDIA-NeMo/RL
+**Key question (untested):** Does NeMo RL's vLLM path handle Mamba-2 autoregressive generation correctly on SM 12.1? If the CUDA kernel issue is PyTorch-specific and vLLM has its own kernels, NeMo RL could work end-to-end on our GB10.
+**Risk:** NeMo RL is designed for multi-GPU clusters. Single-GPU support exists but may have rough edges. Docker-based workflow could hit the same causal-conv1d/mamba-ssm build issues we saw before.
+### NVIDIA's Own Training Pipeline
+The Nemotron-3 paper and blog describe their full training pipeline including RL. NVIDIA trained these models with NeMo RL internally. The training recipe, datasets, and configurations are published.
+**What this means:** The "right" way to RL-train Nemotron is NeMo RL, not a custom script on top of HF transformers.
+---
+## Assumption 2: "Can't use llama.cpp for training"
+**What we believed:** llama.cpp is inference-only. No gradient computation, no backprop. So we need a separate HF model for training.
+**What we missed:**
+### llama.cpp `llama-finetune` (SFT LoRA training)
+llama.cpp has a built-in LoRA fine-tuning binary since 2023 (PR #2632).
+- **Binary exists:** `/home/bobber/llama.cpp/build/bin/llama-finetune` ✅
+- **Supports:** LoRA SFT with AdamW/SGD, learning rate scheduling, validation split, checkpointing
+- **Works on GGUF:** Trains directly on quantized GGUF models — no HF conversion needed
+- **On-policy by design:** The same model that generates during training is the one being updated (if we used it for on-policy generation)
+**Current status (tested):**
+```
+llama-finetune -m nemotron-Q4_K_M.gguf -f train.txt -c 64
+→ ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 18446744073709547520
+```
+**Crashes** with a near-max uint64 buffer allocation. The finetune code doesn't properly handle NemotronH's Mamba-2 architecture — it computes buffer sizes based on standard transformer assumptions that break for SSM layers. This is a llama.cpp bug, not a fundamental limitation.
+**Viability assessment:**
+- ✅ Binary exists, CLI is full-featured
+- ✅ Would solve off-policy gap (SFT uses same model for forward+backward)
+- ❌ Crashes on NemotronH (buffer size computation bug for Mamba architecture)
+- ❌ SFT only — no GRPO/RL support
+- ⚠️ Even if fixed, SFT was insufficient in previous attempts (scored 2.0-3.2/5)
+### llama.cpp LoRA at Inference Time
+llama.cpp supports loading LoRA adapters at runtime via `--lora` flag. This means:
+- Train a LoRA with any tool (HF, NeMo, etc.)
+- Convert to GGUF LoRA format
+- Load at inference time on top of the base GGUF
+**This could enable on-policy GRPO:** Generate with `base + current LoRA` via llama.cpp server, compute log-probs with HF, update LoRA. But converting LoRA to GGUF every step is expensive.
+---
+## What We Should Have Done
+### Before writing custom GRPO code:
+1. **Check NVIDIA's own training tools first.** NVIDIA published NeMo RL specifically for RL training of Nemotron models. We built a custom off-policy GRPO script on HF transformers instead of using the purpose-built tool.
+2. **Test llama-finetune.** Even if it crashes on NemotronH now, knowing that would have informed the architecture decision. We could have filed a bug or patched the buffer computation.
+3. **Test vLLM generation on GB10.** NeMo RL uses vLLM for generation. If vLLM's Mamba kernels work on SM 12.1, the entire on-policy GRPO pipeline works. We never tested this.
+4. **Consider SFT distillation first.** The base model was already excellent (4.35/5). Instead of RL, we could have generated a large corpus of high-quality completions with the base model and done SFT distillation. This is what "Option B: SFT on curated completions" proposes, and it was always available.
+### The root cause of the root cause:
+We jumped to "build custom GRPO" because we knew the components (llama.cpp for generation, HF for training) and it felt like a clever workaround for the GPU limitation. But we didn't step back and ask: **"How did NVIDIA train this model?"** — the answer (NeMo RL) was in their documentation all along.
+---
+## Updated Alternative Paths
+| Path | Fixes | Risk | Effort |
+|------|-------|------|--------|
+| **NeMo RL on-policy GRPO** | All 6 gaps | vLLM may not work on SM 12.1; Docker build issues | Medium-High |
+| **SFT on curated completions** | Gaps 1-3, 5-6 | SFT previously scored ≤3.2/5; may not beat base | Low |
+| **Fix llama-finetune for NemotronH** | Gap 1 (SFT only) | SFT limitation; patch effort unknown | Medium |
+| **On-policy GRPO with periodic merge** | Gap 1 | Merge→GGUF→reload cycle per N steps; slow | High |
+| **Ship base model** | N/A | Already 4.35/5; may be good enough | None |
+| **vLLM test on GB10** | Determines if NeMo RL viable | May crash like PyTorch did | Low (just a test) |
+---
+## vLLM Test Results (2026-03-23)
+**✅ vLLM 0.18.0 works on GB10 with Nemotron-3-Nano-4B!**
+### Setup
+- vLLM 0.18.0, torch 2.10.0+cu130, transformers 5.3.0
+- `gpu_memory_utilization=0.3` (safe with ComfyUI using ~20 GB)
+- Model loaded in 7.43 GiB, CUDA graphs captured successfully
+- FlashAttention v2 backend selected automatically
+### Compatibility Notes
+- vLLM 0.18.0 requires `transformers<5`, but Nemotron model requires `transformers>=5` (`TokenizersBackend` class). Installing transformers 5.3.0 over vLLM's constraint **works despite the pip warning**.
+- Same SM 12.1 PyTorch warning appears but doesn't block execution.
+### Generation Quality
+The output is **coherent** — proper English, structured thinking, model understands the prompt. This confirms vLLM's Mamba-2 kernels work correctly on SM 12.1, unlike raw PyTorch autoregressive generation.
+### Performance
+- **3.59 tok/s output** — significantly slower than llama.cpp (~60 tok/s Q8, ~100+ tok/s Q4)
+- Startup: ~205 seconds (model loading + torch.compile + CUDA graph capture)
+- This is BF16 (8 GB) vs llama.cpp's Q4 (2.9 GB), so memory bandwidth disadvantage explains much of the speed difference
+### Implications for NeMo RL
+vLLM generation works on GB10 → **NeMo RL on-policy GRPO is likely viable on this hardware.** The slow generation speed (3.59 tok/s) means training would be much slower than llama.cpp-based generation, but it would be **correct** (on-policy, same model generates and trains).
+### Remaining Question — ANSWERED ✅
+**Can vLLM generation + PyTorch training coexist on the same GPU?**
+**Yes — comfortably.** Tested 2026-03-23 with dual model loading:
+#### Test Setup
+- vLLM 0.18.0 in subprocess (`gpu_memory_utilization=0.3`, `max_model_len=1024`)
+- HF model (transformers 5.3.0 native NemotronH) in main process, full BF16
+- AdamW optimizer with full parameter training
+- Forward + backward + optimizer step, then vLLM generation while training model loaded
+#### Memory Results
+| Component | Allocated | Reserved |
+|-----------|-----------|----------|
+| vLLM subprocess (model + KV cache + CUDA graphs) | ~39 GB | (separate process) |
+| HF training model (BF16) | 5.27 GB | 5.32 GB |
+| Peak during backward | 10.62 GB | 20.15 GB |
+| After optimizer.step (states allocated) | 21.16 GB | 27.06 GB |
+| After zero_grad (steady state) | 15.89 GB | 27.06 GB |
+**Total estimated GPU usage: ~55 GB | Free: ~76 GB**
+With LoRA (realistic for NeMo RL): **~39 GB total | ~92 GB free**
+#### Key Observations
+1. **vLLM runs in a separate subprocess** — its memory doesn't appear in the main process's `torch.cuda.memory_allocated()`. Both share the GPU via CUDA's unified memory management.
+2. **vLLM generation works fine while training model holds 27 GB** — no interference, no OOM.
+3. **Generation speed during concurrent load: 4.56 tok/s** (slightly faster than solo, likely CUDA graph warmup effect).
+#### Caveats
+- **Native transformers NemotronH has a config mismatch**: NVIDIA's `hybrid_override_pattern` uses dashes (`M-M-M-MM-...`) which the native `_pattern_to_list` doesn't handle. After patching dash handling, the pattern produced 25 layers instead of 42, causing MISSING/UNEXPECTED weight warnings and wrong loss (13.16 vs expected ~3-4). For real training, either:
+  - Fix the config conversion properly for native transformers, or
+  - Use `trust_remote_code=True` with a pure PyTorch fallback for `causal_conv1d` (no CUDA kernel needed for training, only the forward/backward math)
+- **GB10's nvidia-smi returns [N/A]** for memory stats, so cross-process GPU memory can't be directly measured — estimates based on vLLM's 0.3 utilization setting.
+#### Verdict
+Memory is **not a bottleneck** for on-policy GRPO on GB10. The open questions are now:
+1. Can NeMo RL be installed on GB10? (Docker vs native)
+2. Does NeMo RL's training loop work with the native transformers NemotronH, or does it require the custom code?
+3. What's the end-to-end training throughput given vLLM's ~3.5 tok/s generation speed?
+---
+## Lessons
+1. **Check the vendor's tools before building custom.** NVIDIA published NeMo RL for exactly this use case. We reinvented a broken wheel.
+2. **"Can't do X" should trigger "how does the vendor do X?"** — not "let me build a workaround for X."
+3. **Test alternatives before committing to workarounds.** We spent 10+ hours on off-policy GRPO instead of 15 minutes testing vLLM or llama-finetune.
+4. **The cleverest workaround is often the wrongest approach.** Off-policy GRPO felt smart — two models collaborating! But it violated fundamental RL principles.
+---
+*Created: 2026-03-23*

docs/TRAINING_PLAN_V5.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# Training Plan V5 — Beat Base Model (7.12/10)
+**Goal:** Nemotron 4B fine-tuned > 7.12/10
+**Updated:** 2026-03-26 UTC
+**Status:** GRPO v5 ready to run — all infrastructure proven
+---
+## Current Baseline
+Every fine-tuning attempt so far has scored below base:
+| Approach | Score | Why it failed |
+|---|---|---|
+| LoRA SFT (v1, v2) | 2.00–2.10/5 | LoRA targets only 4 attention layers; 38 Mamba layers untouched |
+| Full SFT (v1) | 2.00/5 | Format mismatch — think-tags buried output |
+| Full SFT (v2, v3) | 3.20/5 | Contaminated data (`user\n` prefixes, low question rate) |
+| SFT v4 Triton (LoRA) | 5.36/10 | Surface pattern-matching, no reasoning preserved |
+| GRPO v3 (off-policy) | gibberish | Generator ≠ learner — 6 critical gaps |
+| GRPO v4 (PyTorch mock) | oscillating loss | rmsnorm mock silently broken |
+**Base model: 7.12/10** — uses `<think>` to construct precise, contextual questions.
+---
+## GRPO v5 — The Current Approach
+All 6 gaps from v3 postmortem fixed:
+### Architecture
+```
+Per step:
+  1. generate_cached_batch(model, n=4)   ← 1× prefill, 4 parallel decodes
+  2. reward_fn() for each completion      ← heuristic: question?, references guest?, brevity?, etc.
+  3. GRPO normalize advantages within group
+  4. policy log-probs: model(comp)        ← with LoRA enabled
+  5. ref log-probs: model(comp)           ← with LoRA disabled (free, no second model)
+  6. loss = -adv * log_prob + kl_coef * KL
+  7. backward() through Triton SSM kernel ← all 42 layers trained
+```
+### Key design choices
+- **On-policy**: `generate_cached()` uses the same patched Mamba model (Triton kernel) that computes gradients. Zero off-policy gap.
+- **KL without second model**: Disabling PEFT adapter layers = base model forward. No extra memory.
+- **Batched generation**: `generate_cached_batch(n=4)` — one prefill shared across 4 completions. ~4× faster generation than serial calls.
+- **Vectorized SSM state save**: Final state computed as a single tensor op (reverse cumsum), no Python loop across sequence length.
+- **Triton backward**: Full gradient through all 21 Mamba layers + 4 Attention layers.
+### Config
+```python
+lr              = 5e-5
+kl_coef         = 0.02
+total_steps     = 500
+num_generations = 4       # per prompt
+gen_max_tokens  = 300
+lora_rank       = 64
+lora_alpha      = 256
+```
+### Expected performance
+- **Generation**: ~15-20s (batched prefill + parallel decode, ~16 tok/s cached)
+- **Triton backward**: ~9.5s/step
+- **Total step**: ~25-40s (vs 464s in earlier uncached version, vs 95s in serial cached version)
+- **500 steps**: ~4-6 hours
+---
+## Decision Tree Post-Run
+```
+GRPO v5 result
+    ├── Score > 7.12/10 → ✅ Beat base model! Run full 500 steps, publish.
+    ├── Score 6.5–7.12 → Good progress. Add thinking-chain data (see Phase 2).
+    ├── Score 5.5–6.5  → Similar to SFT. GRPO signal is weak. Try Phase 2 data.
+    └── Score < 5.5    → Something broken. Debug reward/KL balance.
+```
+---
+## Phase 2: Thinking-Chain Data (if GRPO v5 insufficient)
+The base model's edge is `<think>` reasoning. Teach the fine-tuned model to reason:
+1. Run base model on all 7,580 training scenarios via `generate_cached`
+2. Collect `<think>...</think>[question]` responses
+3. Filter: keep responses scoring ≥ 7/10 on 10-score eval
+4. SFT on these thinking-chain examples — distillation from base model's own best outputs
+This is fundamentally different from previous SFTs (which used human transcripts). The training data would be base model completions with explicit reasoning chains — teaching the fine-tuned model *how* to think, not just what format to produce.
+**Estimated corpus**: ~3,000-5,000 examples from 7,580 scenarios
+**Training time**: 3,000 steps × 9.5s = ~8 hours
+---
+## Phase 3: Full Dataset SFT Baseline (optional)
+Previous SFTs used only 1,000 samples. 7,580 are available.
+```python
+SAMPLES    = 7580
+EPOCHS     = 2
+LR         = 1e-4
+BATCH      = 1
+GRAD_ACCUM = 4
+```
+- **Steps**: 3,790 (2 epochs)
+- **Time**: ~10 hours at 9.5s/step
+- **Expected**: 6.0–6.5/10 (extrapolating from full-sft-v4/v5 pattern)
+Run this as a parallel experiment to understand data volume impact independently of GRPO.
+---
+## Infrastructure Status
+| Component | Status | Notes |
+|---|---|---|
+| Triton fwd kernel | ✅ Working | 10.6× forward speedup |
+| Triton bwd kernel | ✅ Working | 5.7× end-to-end speedup, 9.5s/step |
+| SSM state save (vectorized) | ✅ Working | No Python loop, pure tensor op |
+| Cached generation | ✅ Working | ~16 tok/s, O(1) per token |
+| Batched generation (n=4) | ✅ Working | 1× prefill for n completions |
+| GRPO v5 training loop | ✅ Implemented | `grpo_v5_train.py` |
+| 10-score eval | ✅ Working | `scripts/eval_v2.py` |
+| GGUF conversion + eval | ✅ Working | `merge_and_eval.py` |
+---
+## Timeline
+| Step | Time | Outcome |
+|---|---|---|
+| GRPO v5 smoke test (5 steps) | ~5 min | Verify step time, reward signal, loss stability |
+| GRPO v5 eval checkpoint (50 steps) | ~30 min | Merge → GGUF → 10-score eval |
+| GRPO v5 full run (500 steps) | ~4-6 hours | Main result |
+| If needed: thinking-chain data gen | ~3 hours | 7,580 base model completions |
+| If needed: thinking-chain SFT | ~8 hours | Phase 2 fine-tune |
+---
+*Updated: 2026-03-26 | Previous: `docs/CURRENT_STATE_2026-03-23.md`*

docs/TRITON_PIPELINE_FIX.md ADDED Viewed

	@@ -0,0 +1,137 @@

+# Triton Pipeline Fix — NemotronH HF Generation
+Date: 2026-03-27
+Status: ✅ Fixed and verified
+---
+## Problem
+HF model generation with `generate_cached` produced degenerate output:
+- Max logit divergence of 10+ between cached and uncached decode
+- Tokens like `│││││` and `«»«««` repeated endlessly
+- `</think>` never closed naturally
+- Outputs completely incoherent after a few tokens
+Previously attributed to numerical issues (BF16 SSM precision), but those were fixed.
+The real bug was elsewhere.
+---
+## Diagnosis
+Systematic elimination:
+| Component | Test | Result |
+|---|---|---|
+| Prefill formula | vs `mamba_chunk_scan_combined` | ✅ 0.04% error (floating point) |
+| Decode formula | vs `selective_state_update`, 100 steps | ✅ 0.000% error |
+| Layer-by-layer hidden states | cached vs uncached | ❌ Layer 12 diverges 10.89 logits |
+Layer 12 is the **first attention layer**. Layers 0-11 (all Mamba) were fine.
+---
+## Root Cause
+`NemotronHBlock.forward` for attention layers called:
+```python
+# BROKEN (model source code bug):
+hidden_states = self.mixer(
+    hidden_states,
+    cache_position=cache_position
+    # past_key_value NOT PASSED
+)
+```
+`NemotronHAttention.forward` requires `past_key_value` for KV caching. Without it,
+every decode token ran attention over **only itself** — single-token context with no
+history from prefill. This produced garbage at layers 12, 17, 24, 32 (the 4 attention
+layers) and cascaded through all 42 layers.
+---
+## Fix
+Added `_patch_attention_block()` to `tests/validate_correct_scan.py`.
+Called automatically from `patch_mamba_layers()`.
+```python
+def _patch_attention_block(layer):
+    def patched_forward(self, hidden_states, cache_params=None,
+                        cache_position=None, attention_mask=None):
+        ...
+        hidden_states = self.mixer(
+            hidden_states,
+            past_key_value=cache_params,   # ← THE FIX
+            cache_position=cache_position,
+            use_cache=(cache_params is not None),
+        )
+        hidden_states = hidden_states[0]
+        ...
+    layer.forward = patched_forward.__get__(layer, layer.__class__)
+```
+`HybridMambaAttentionDynamicCache` already inherits `DynamicCache.update()` which
+correctly maintains `key_cache[layer_idx]` / `value_cache[layer_idx]`.
+---
+## Result
+| Metric | Before Fix | After Fix |
+|---|---|---|
+| Max logit diff (cached vs uncached) | 10.89 | 0.23 |
+| First diverging layer | 12 | None (all within BF16 noise) |
+| Generation quality | Degenerate (repeating tokens) | Clean, coherent Lex-style questions |
+| `</think>` closing | Never (corrupted states) | Naturally at 200-3000 tokens |
+---
+## Complete Working Pipeline
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from tests.validate_correct_scan import patch_mamba_layers
+from ssm_generate import generate_cached
+model = AutoModelForCausalLM.from_pretrained(
+    'models/NVIDIA-Nemotron-3-Nano-4B',
+    torch_dtype=torch.bfloat16,
+    trust_remote_code=True,
+    device_map='cuda'
+)
+# Apply all fixes in one call:
+# 1. Triton SSM scan for prefill (training + fast inference)
+# 2. fp32 decode step (matches llama.cpp/vLLM precision)
+# 3. Attention KV cache fix (passes past_key_value to NemotronHAttention)
+patch_mamba_layers(model, use_triton=True)
+model.eval()
+# Generate — now works correctly
+out = generate_cached(model, tokenizer, input_ids, max_new_tokens=1500)
+```
+---
+## Why vLLM Was the Workaround
+vLLM's `nemotron_h.py` backend reimplements the attention forward correctly
+with its own KV cache management. It never hit this bug because it replaced
+the entire `NemotronHBlock.forward` with its own implementation.
+With this fix, the HF model + Triton pipeline produces equivalent output to vLLM
+for standard generation tasks.
+---
+## Verified Against vLLM
+Both HF (patched) and vLLM now produce:
+- Clean Lex-style interviewer questions
+- `</think>` closes in 200-3000 tokens depending on prompt
+- Consistent quality across diverse guests and topics
+The HF model is now suitable for on-policy GRPO generation without vLLM dependency.

docs/TRITON_SSM_SCAN_PLAN.md ADDED Viewed

	@@ -0,0 +1,114 @@

+# Triton SSM Scan Kernel for SM 12.x (GB10 Blackwell)
+**Status:** ✅ SUPERSEDED — Custom Triton kernel no longer needed.
+**Date superseded:** 2026-03-30
+**Original completion:** 2026-03-25/26
+---
+## ⚠️ SUPERSEDED — Read This First
+The custom Triton SSM scan kernel was built to work around the inability to use compiled `mamba_ssm` extensions on GB10 (SM 12.1). **That blocker no longer exists.**
+As of 2026-03-30, we successfully compiled `mamba_ssm 2.3.1` from source against `torch 2.10.0+cu130` in `.venv-train`. The compiled Triton kernels (`mamba_chunk_scan_combined`, `selective_state_update`) now run natively on GB10.
+### Validation (2026-03-30)
+```
+Seq=512, A=-0.1 (moderate):  nan=False, inf=False, max=155  ✅
+Seq=512, A=-10  (extreme):   nan=False, inf=False, max=200  ✅
+Forward + backward: 263 params with gradients, grad_norm=0.0212 ✅
+```
+No BF16 underflow issues observed. The compiled kernel uses chunked scan with internal float32 accumulation for the state update, preventing the catastrophic underflow that affected the naive PyTorch implementation.
+### What Was Wrong Originally
+The original GB10 problem (2026-03-24):
+- Pre-compiled `causal_conv1d` and `mamba_ssm` `.so` files targeted SM ≤ 12.0
+- Our aarch64 binaries had wrong architecture or ABI mismatches
+- Workaround: custom sequential scan + custom Triton kernel
+### What Fixed It
+```bash
+# In .venv-train (Python 3.12, torch 2.10+cu130):
+pip install git+https://github.com/state-spaces/mamba.git \
+  --no-build-isolation --force-reinstall --no-deps
+```
+Compiling from source lets the build system detect SM 12.1 and generate correct PTX/SASS via Triton JIT. The key was:
+1. Having the right torch (CUDA build, matching ABI)
+2. Compiling mamba_ssm from source (not using pre-built wheels)
+3. Setting `LD_LIBRARY_PATH` so `libc10.so` is found by the dynamic linker
+---
+## Original Plan (Archived for Reference)
+The original plan was to build a custom Triton kernel because:
+- `mamba_ssm` compiled extensions failed on GB10
+- The pure-PyTorch sequential scan was correct but slow (~50s/step)
+- Triton JIT compiles at runtime → supports any SM including 12.x
+**This plan was executed and produced a working kernel (5.7x speedup)** stored in `ssm_scan_triton.py` and `tests/validate_correct_scan.py`. The `patch_mamba_layers()` function in `validate_correct_scan.py` uses this custom kernel.
+### Current Recommendation
+**Do NOT use `patch_mamba_layers()` for new training runs.** Use the compiled `mamba_ssm` directly via `.venv-train`. The custom patch was a workaround that is now unnecessary and adds complexity.
+The `patch_mamba_layers()` code remains available for:
+- Debugging/comparison purposes
+- Environments where compiling mamba_ssm from source is not possible
+---
+## What We Learned
+### 1. Pre-compiled wheels are architecture/ABI specific
+PyPI wheels for `mamba_ssm` and `causal_conv1d` are compiled against specific SM targets and torch ABI versions. When torch version changes or GPU architecture is new, they break silently.
+**Rule:** Always compile `mamba_ssm` from source in new environments (`--no-build-isolation`).
+### 2. LD_LIBRARY_PATH is the key unlock
+`selective_scan_cuda.cpython-312-aarch64-linux-gnu.so` links against `libc10.so` which lives inside `torch/lib/`. Without LD_LIBRARY_PATH pointing there, the import fails with "undefined symbol" — even though the .so exists.
+```bash
+TORCH_LIB=$(python3 -c "import torch; from pathlib import Path; print(Path(torch.__file__).parent/'lib')")
+export LD_LIBRARY_PATH="$TORCH_LIB:${LD_LIBRARY_PATH:-}"
+```
+### 3. The BF16 underflow problem was real but already solved
+Our original `ssm_scan_correct.py` fix was correct: `exp(cumsum(A))` in BF16 over long sequences → catastrophic underflow. The compiled `mamba_chunk_scan_combined` kernel handles this via chunked computation with fp32 accumulation at chunk boundaries. Verified on GB10 with extreme A=-10, seq_len=512: no NaN/Inf.
+### 4. Custom Triton kernel still valuable for pure-PyTorch environments
+If you can't compile mamba_ssm from source (e.g., no nvcc, CI environment), the custom `patch_mamba_layers()` approach works. But for production training, the compiled kernel is faster and more battle-tested.
+---
+## Performance Comparison (Final)
+| Approach | Speed | Correctness | Complexity |
+|----------|-------|-------------|------------|
+| PyTorch sequential scan (original workaround) | ~55s/step | ✅ | Low |
+| Custom Triton kernel (`patch_mamba_layers`) | ~9.5s/step | ✅ | High |
+| **Compiled mamba_ssm (current)** | **~8.5s/step** | **✅** | **Low** |
+Compiled mamba_ssm is the clear winner: slightly faster than custom Triton, same correctness, zero custom code to maintain.
+---
+## Files (Status)
+| File | Status | Notes |
+|------|--------|-------|
+| `ssm_scan_correct.py` | Archive | Original sequential scan — reference only |
+| `ssm_scan_triton.py` | Archive | Custom Triton kernel — workaround, no longer needed |
+| `tests/validate_correct_scan.py` | Keep | Contains `patch_mamba_layers()` for legacy use |
+| `ssm_scan_backward.py` | Archive | Custom backward — superseded |
+| `.venv-train` | Active | Compiled mamba_ssm — USE THIS |
+---
+*Original plan created: 2026-03-25*
+*Superseded: 2026-03-30 — compiled mamba_ssm works natively on GB10*

docs/VENV_SETUP.md ADDED Viewed

	@@ -0,0 +1,118 @@

+# Venv Setup — Lex Fridman Interviewer Project
+Updated: 2026-03-30
+## The Three Venvs
+### `.venv-train` — SFT Training (PRIMARY)
+```bash
+# Activate with required LD_LIBRARY_PATH:
+TORCH_LIB=/home/bobber/lex-ft/.venv-train/lib/python3.12/site-packages/torch/lib
+export LD_LIBRARY_PATH="$TORCH_LIB:${LD_LIBRARY_PATH:-}"
+source /home/bobber/lex-ft/.venv-train/bin/activate
+```
+| Package | Version | Notes |
+|---------|---------|-------|
+| Python | 3.12.3 | |
+| torch | 2.10.0+cu130 | CUDA enabled, installed with `--index-url https://download.pytorch.org/whl/cu130` |
+| unsloth | 2026.3.17 | 2x faster training, memory optimizations |
+| mamba_ssm | 2.3.1 | **Compiled from source** against torch 2.10 — real Triton kernels |
+| transformers | 4.48.3 | |
+| datasets | 4.8.4 | |
+| accelerate | 1.13.0 | |
+| wandb | 0.25.1 | |
+| trl | 0.15.2 | |
+**Why LD_LIBRARY_PATH:** mamba_ssm's compiled `.so` links against `libc10.so` which is inside `torch/lib/`. The system linker doesn't find it without this path set.
+**Why NOT to use routangseng venv for training:** routangseng/.venv has Python 3.13 with broken mamba_ssm (selective_scan_cuda has undefined symbol ABI mismatch).
+### `.venv-vllm` — Inference & Evaluation
+```bash
+source /home/bobber/lex-ft/.venv-vllm/bin/activate
+```
+| Package | Version | Notes |
+|---------|---------|-------|
+| Python | 3.12 | |
+| torch | 2.9+cu130 | |
+| vllm | 0.18.0 | Fast batch inference |
+| transformers | 5.3.0 | Has NemotronH built-in |
+| peft | 0.18.1 | For LoRA inference |
+| sentence_transformers | 5.3.0 | For embedding similarity |
+| datasets | | |
+| wandb | | |
+**Use for:** vLLM inference, `eval_functional_judge.py`, `judge_vllm.py`, `augment_with_base_model.py`
+**Note:** transformers 5.3.0 was manually upgraded from 4.57.6 (`pip install transformers==5.3.0`). vllm warns about incompatibility but works.
+### `routangseng/.venv` — Legacy / Qwen
+```bash
+source /home/bobber/routangseng-ft/.venv/bin/activate
+```
+Used for: Qwen3.5-4B inference (judge), old SFT scripts. Not recommended for Nemotron SFT.
+---
+## Common Issues & Fixes
+### ImportError: selective_scan_cuda undefined symbol
+**Cause:** mamba_ssm `.so` compiled against different torch ABI
+**Fix:** Recompile from source: `pip install git+https://github.com/state-spaces/mamba.git --no-build-isolation --force-reinstall --no-deps`
+### PermissionError: /home/bobber/.cache/huggingface/modules/...
+**Cause:** Root-owned HF modules cache
+**Fix:** Set `HF_MODULES_CACHE=/home/bobber/lex-ft/.cache/hf_modules` before any HF imports
+### torch.cuda.is_available() = False despite CUDA torch installed
+**Cause:** Missing `LD_LIBRARY_PATH` for `libcuda.so`/`libc10.so`
+**Fix:** Set `LD_LIBRARY_PATH` to torch's lib directory before running
+### AttributeError: 'list' object has no attribute 'keys' (NemotronH + transformers)
+**Cause:** `_get_tied_weight_keys` bug in transformers for NemotronH
+**Fix:** Monkey-patch before model load (see `train_sft_v5.py` top of file)
+### 'qwen3_5' KeyError in transformers 4.57.6
+**Cause:** transformers 4.57.6 doesn't know Qwen3.5 architecture
+**Fix:** Use routangseng/.venv (transformers 5.3.0) or .venv-vllm (upgraded to 5.3.0)
+---
+## Rebuild .venv-train From Scratch
+```bash
+# 1. Create venv
+python3.12 -m venv /home/bobber/lex-ft/.venv-train
+# 2. Install CUDA torch
+.venv-train/bin/pip install torch==2.10.0+cu130 \
+  --index-url https://download.pytorch.org/whl/cu130
+# 3. Set LD_LIBRARY_PATH for subsequent installs
+export LD_LIBRARY_PATH=/home/bobber/lex-ft/.venv-train/lib/python3.12/site-packages/torch/lib:$LD_LIBRARY_PATH
+# 4. Install training stack
+.venv-train/bin/pip install unsloth transformers datasets accelerate trl wandb
+# 5. Compile mamba_ssm from source (needed for NemotronH Mamba-2 layers)
+.venv-train/bin/pip install git+https://github.com/state-spaces/mamba.git \
+  --no-build-isolation --force-reinstall --no-deps
+# 6. Verify
+.venv-train/bin/python3 -c "
+import torch; print('cuda:', torch.cuda.is_available())
+import mamba_ssm; from mamba_ssm.ops.triton.ssd_combined import mamba_chunk_scan_combined
+print('ssd_combined:', 'REAL' if mamba_chunk_scan_combined else 'None')
+from unsloth import FastLanguageModel; print('unsloth: OK')
+"
+```
+---
+*Created: 2026-03-30 04:35 UTC*

docs/VLLM_SETUP_NOTES.md ADDED Viewed

	@@ -0,0 +1,146 @@

+# vLLM Setup Notes — DGX Spark (GB10, aarch64)
+Date: 2026-03-27
+Status: ✅ Working — vLLM 0.18.0 with NemotronH
+---
+## Previous Failure (v7 era)
+We tried vLLM earlier and it failed. The failure was caused by the same CUDA path issue as mamba-ssm:
+- `pip install vllm` pulled `torch 2.10.0+cpu` (CPU-only wheel) from default PyPI
+- No `libtorch_cuda.so` → `ImportError` on first import
+- We gave up and used llama.cpp server instead (off-policy GRPO)
+## What Works Now
+vLLM 0.18.0 loads `NemotronH` natively via its own `nemotron_h.py` backend.
+**No `mamba-ssm` required** — vLLM has its own Mamba-2 implementation (`mamba2.py`).
+Confirmed working:
+- `</think>` closes naturally (P(</think>) is non-trivial with vLLM's kernel)
+- Batch generation works (multiple prompts in one call)
+- Output quality is good: clean Lex-style questions, correct structure
+Example output:
+```
+✅ [Andrej Karpathy] think_end=2384 ntok=668
+   "That's a great observation — if I gave you a single cat image right now,
+   how would a neural network actually recognize it?"
+✅ [Elon Musk] think_end=3527 ntok=791
+   "When comparing AI risks to climate change, what specific mechanisms do
+   you see as making AI a greater existential threat?"
+✅ [A quantum physicist] think_end=1630 ntok=392
+   "If time emerges from entanglement, does that imply that the flow of
+   time is an illusion, or is there a deeper emergent structure?"
+```
+---
+## Installation
+### Separate venv (required — don't pollute .venv-train)
+```bash
+cd /home/bobber/lex-ft
+python3 -m venv .venv-vllm
+source .venv-vllm/bin/activate
+# Step 1: Install CUDA torch FIRST (must use cu130 index, not default PyPI)
+pip install torch==2.10.0+cu130 \
+    --index-url https://download.pytorch.org/whl/cu130
+# Step 2: Install vLLM (uses the torch already installed)
+CUDA_HOME=/usr/local/cuda-13.0 \
+PATH=/usr/local/cuda-13.0/bin:$PATH \
+pip install vllm
+```
+**Do NOT** run `pip install vllm` without installing CUDA torch first — PyPI will pull the CPU-only torch wheel.
+### Verify
+```python
+import torch
+print(torch.__version__)   # 2.10.0+cu130
+print(torch.cuda.is_available())  # True
+import vllm
+print(vllm.__version__)    # 0.18.0
+```
+---
+## Loading NemotronH
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model='models/NVIDIA-Nemotron-3-Nano-4B',
+    trust_remote_code=True,
+    max_model_len=4096,
+    gpu_memory_utilization=0.55,  # leaves ~57GB for training venv
+    dtype='bfloat16',
+)
+```
+Cold start: ~80s (loads 4B safetensors). Warm: instant.
+GPU memory: with `gpu_memory_utilization=0.55`, vLLM uses ~70GB, leaving ~57GB for the HF training model.
+---
+## Generation
+```python
+from transformers import AutoTokenizer
+import re
+tok = AutoTokenizer.from_pretrained(
+    'models/NVIDIA-Nemotron-3-Nano-4B', trust_remote_code=True)
+# enable_thinking=True → prompt ends with <think>\n
+# Model generates thinking then closes </think> naturally
+msgs = [
+    {'role': 'system', 'content': 'You are a Lex Fridman interviewer.\n\nGuest: Andrej Karpathy'},
+    {'role': 'user',   'content': 'Neural networks are simple.'}
+]
+prompt = tok.apply_chat_template(
+    msgs, tokenize=False, add_generation_prompt=True, enable_thinking=True)
+params = SamplingParams(temperature=1.0, top_p=0.95, max_tokens=1500)
+outputs = llm.generate([prompt], params)
+raw = outputs[0].outputs[0].text
+te = raw.find('</think>')
+answer = re.sub(r'<\|[^>]+\|>', '', raw[te+8:]).strip() if te >= 0 else ''
+```
+For `enable_thinking=False` (no-think mode):
+```python
+prompt = tok.apply_chat_template(
+    msgs, tokenize=False, add_generation_prompt=True, enable_thinking=False)
+# Prompt ends with <think></think> — model answers directly, no thinking phase
+```
+---
+## Key Notes
+- vLLM uses its own Mamba-2 kernel (not mamba-ssm) — no mamba-ssm needed in .venv-vllm
+- The `nemotron_h.py` model backend handles the hybrid Mamba-2 + attention architecture
+- Warning `Add 3 padding layers, may waste at most 14.29% KV cache memory` is expected — harmless
+- `gpu_memory_utilization` must be set; default is 0.9 which will OOM when training venv also loads
+- For GRPO: use vLLM in one process, HF model in another (or sequential with GPU transfer)
+---
+## venv Summary
+| venv | Purpose | torch | key packages |
+|---|---|---|---|
+| `.venv-train` | HF training (LoRA + optimizer) | 2.11.0+cu130 | mamba-ssm 2.3.1, causal-conv1d 1.6.1, bitsandbytes |
+| `.venv-vllm` | vLLM generation | 2.10.0+cu130 | vllm 0.18.0, flashinfer 0.6.6 |