--- language: - en license: other tags: - nemotron - onnx - webgpu - interview - lex-fridman - grpo - transformers.js pipeline_tag: text-generation --- # lex-interviewer-nemotron-4b-grpo-v21 **Nemotron-3-Nano-4B** fine-tuned with GRPO to conduct Lex Fridman–style interviews. Deployed as a **WebGPU Q4 ONNX model** for in-browser inference via [transformers.js](https://huggingface.co/docs/transformers.js). --- ## Checkpoint: GRPO v21 This is the best-performing checkpoint from a series of GRPO experiments on the Lex Fridman interviewer task. | Metric | Value | |---|---| | Thinking-enabled functional eval | **0.867 ± 0.231** | | on_topic | 84% | | uses_guest | 80% | | probing | 96% | Significantly outperforms the base Nemotron-3-Nano-4B model (0.760) and all prior fine-tuned checkpoints. --- ## What this model does Given a guest's statement, the model asks one focused, incisive follow-up question that: - uses the guest's specific vocabulary - probes the reasoning or implication behind what they said - ends with exactly one question mark It uses **Nemotron's extended thinking** (`enable_thinking: true`) to reason before generating the question. --- ## Why GRPO v21 succeeded Measured across v21, v22, v23, v24 experiments: ``` GRPO_success = P(at least 1 zero per group) ≈ 0.25–0.35 × hard binary reward gate (clear zeros vs. 0.7+ goods) × starting below the reward optimum ``` GRPO learns from **contrast**, not from correctness. v21 hit the Goldilocks zone: - ~32% of training steps had at least one clipped/failed completion → high intra-group std - reward_v12's hard gate (fail = exactly 0.0, pass = 0.7+) maximized advantage magnitude - starting from `sft-lora-v2-native` left room to climb Full analysis: `docs/GRPO_V21_SUCCESS_ANALYSIS.md` in `bobber/lex-fridman-interviewer-project`. --- ## ONNX export details Built using the LoRA-only patching strategy from the project retrospective: - **Reference base:** `onnx-community/NVIDIA-Nemotron-3-Nano-4B-BF16-ONNX` (Q4 format) - **Patched layers:** only the 50 LoRA target weight groups (`q/k/v/o_proj`, `up/down/gate_proj`) - **Preserved from reference:** all Mamba layers, embedding, lm_head (prevents WebGPU precision regression) - **Quantization:** asymmetric uint4 block quantization (MatMulNBits, block_size=32) Scripts: `scripts/merge_lora_v21.py`, `scripts/patch_q4_loraonly.py` in the project repo. --- ## Usage (transformers.js) ```js import { pipeline } from '@huggingface/transformers'; const interviewer = await pipeline( 'text-generation', 'bobber/lex-interviewer-nemotron-4b-grpo-v21', { dtype: 'q4', device: 'webgpu' } ); const messages = [ { role: 'system', content: 'You are an expert podcast interviewer...\n\nGuest: Andrej Karpathy' }, { role: 'user', content: 'What is your next question?' } ]; const result = await interviewer(messages, { max_new_tokens: 800, do_sample: true, temperature: 0.7, chat_template_kwargs: { enable_thinking: true } }); ``` --- ## Live demo [bobber/lex-interviewer-chat](https://huggingface.co/spaces/bobber/lex-interviewer-chat) — runs entirely in your browser via WebGPU. --- ## Related - Project repo & docs: `bobber/lex-fridman-interviewer-project` - GRPO v21 success analysis: `docs/GRPO_V21_SUCCESS_ANALYSIS.md` - ONNX retrospective: `docs/ONNX_RETROSPECTIVE.md`