---
language:
- en
license: other
tags:
- nemotron
- onnx
- webgpu
- interview
- lex-fridman
- grpo
- transformers.js
pipeline_tag: text-generation
---

# lex-interviewer-nemotron-4b-grpo-v21

**Nemotron-3-Nano-4B** fine-tuned with GRPO to conduct Lex Fridman–style interviews.  
Deployed as a **WebGPU Q4 ONNX model** for in-browser inference via [transformers.js](https://huggingface.co/docs/transformers.js).

---

## Checkpoint: GRPO v21

This is the best-performing checkpoint from a series of GRPO experiments on the Lex Fridman interviewer task.

| Metric | Value |
|---|---|
| Thinking-enabled functional eval | **0.867 ± 0.231** |
| on_topic | 84% |
| uses_guest | 80% |
| probing | 96% |

Significantly outperforms the base Nemotron-3-Nano-4B model (0.760) and all prior fine-tuned checkpoints.

---

## What this model does

Given a guest's statement, the model asks one focused, incisive follow-up question that:
- uses the guest's specific vocabulary
- probes the reasoning or implication behind what they said
- ends with exactly one question mark

It uses **Nemotron's extended thinking** (`enable_thinking: true`) to reason before generating the question.

---

## Why GRPO v21 succeeded

Measured across v21, v22, v23, v24 experiments:

```
GRPO_success = P(at least 1 zero per group) ≈ 0.25–0.35
             × hard binary reward gate (clear zeros vs. 0.7+ goods)
             × starting below the reward optimum
```

GRPO learns from **contrast**, not from correctness. v21 hit the Goldilocks zone:
- ~32% of training steps had at least one clipped/failed completion → high intra-group std
- reward_v12's hard gate (fail = exactly 0.0, pass = 0.7+) maximized advantage magnitude
- starting from `sft-lora-v2-native` left room to climb

Full analysis: `docs/GRPO_V21_SUCCESS_ANALYSIS.md` in `bobber/lex-fridman-interviewer-project`.

---

## ONNX export details

Built using the LoRA-only patching strategy from the project retrospective:

- **Reference base:** `onnx-community/NVIDIA-Nemotron-3-Nano-4B-BF16-ONNX` (Q4 format)
- **Patched layers:** only the 50 LoRA target weight groups (`q/k/v/o_proj`, `up/down/gate_proj`)
- **Preserved from reference:** all Mamba layers, embedding, lm_head (prevents WebGPU precision regression)
- **Quantization:** asymmetric uint4 block quantization (MatMulNBits, block_size=32)

Scripts: `scripts/merge_lora_v21.py`, `scripts/patch_q4_loraonly.py` in the project repo.

---

## Usage (transformers.js)

```js
import { pipeline } from '@huggingface/transformers';

const interviewer = await pipeline(
  'text-generation',
  'bobber/lex-interviewer-nemotron-4b-grpo-v21',
  { dtype: 'q4', device: 'webgpu' }
);

const messages = [
  { role: 'system', content: 'You are an expert podcast interviewer...\n\nGuest: Andrej Karpathy' },
  { role: 'user', content: 'What is your next question?' }
];

const result = await interviewer(messages, {
  max_new_tokens: 800,
  do_sample: true,
  temperature: 0.7,
  chat_template_kwargs: { enable_thinking: true }
});
```

---

## Live demo

[bobber/lex-interviewer-chat](https://huggingface.co/spaces/bobber/lex-interviewer-chat) — runs entirely in your browser via WebGPU.

---

## Related

- Project repo & docs: `bobber/lex-fridman-interviewer-project`
- GRPO v21 success analysis: `docs/GRPO_V21_SUCCESS_ANALYSIS.md`
- ONNX retrospective: `docs/ONNX_RETROSPECTIVE.md`