# Reproducing the Best Checkpoint (HSS=0.382)

## Quick Start

The `checkpoint.pt` in this repo is the final model. To run inference:

```bash
python script.py
```

To reproduce from scratch (~3hr on 1x RTX 4090):

```bash
bash reproduce.sh
```

## Exact Recipe

Architecture (unchanged across all 3 steps):
```
Perceiver: hidden=256, ff=1024, latent_tokens=256, latent_layers=7
  encoder_layers=4, decoder_layers=3, cross_attn_interval=4
  num_heads=4, kv_heads_cross=2, kv_heads_self=2
  qk_norm=True (L2), rms_norm=True, dropout=0.1
  segments=64, segment_param=midpoint_dir_len, segment_conf=True
  behind_emb_dim=8, vote_features=True, activation=gelu
```

All shared config lives in `configs/base.json`.

### Step 1: 2048 Phase 1 (from scratch) — ~1.5hr

```
Data:       hf://usm3d/s23dr-2026-sampled_2048_v2:train (16,508 samples)
Steps:      0 -> 125,000 (242 epochs)
LR:         3e-4, warmup=10,000
Batch size: 32
Optimizer:  AdamW, betas=(0.9, 0.95), weight_decay=0.01
Sinkhorn:   eps=0.1, iters=20, dustbin=0.3
Conf:       weight=0.1, mode=sinkhorn, head_wd=0.1
Endpoint:   OFF
Aug:        rotate=True, flip=True
Seed:       353
```

Trains the perceiver from random init on 2048-point samples. The sinkhorn
optimal transport loss learns to match predicted segments to ground truth.

**Why 2048 first:** Training directly on 4096 overfits (1.47x train/val ratio
vs 1.19x for 2048). The 2048 model learns better-generalized representations.

**Output:** HSS ~0.28.

### Step 2: 4096 finetune (constant LR) — ~15min

```
Resume:     Step 1 -> step125000.pt
Data:       hf://usm3d/s23dr-2026-sampled_4096_v2:train (15,892 samples)
Steps:      125,001 -> 135,000 (10k steps)
LR:         3e-5 (constant, no cooldown)
Batch size: 64
Endpoint:   OFF
```

Switches input from 2048 to 4096 points, increasing structural coverage from
66% to 74%. The gentle lr (3e-5) preserves learned representations while
adapting to the extra input. Higher LR (>1e-4) causes catastrophic forgetting.

HSS jumps from 0.28 to 0.35 in ~5k steps. Plateaus by 10k steps.

**Output:** HSS ~0.35.

### Step 3: Cooldown with endpoint loss — ~1hr

```
Resume:     Step 2 -> step135000.pt
Data:       hf://usm3d/s23dr-2026-sampled_4096_v2:train
Steps:      135,001 -> 170,000 (35k steps)
LR:         3e-5, cooldown_start=150,000, cooldown_steps=20,000
            (constant 3e-5 for 15k steps, then linear decay to ~0 over 20k)
Batch size: 64
Endpoint:   weight=0.1
```

Adds symmetric endpoint L1 loss (using detached sinkhorn assignment) to
tighten vertex precision. The sinkhorn loss alone operates on segment
midpoint/direction/length and doesn't directly penalize endpoint position error.

**Output:** HSS=0.382, F1=0.414.

### Key Numbers

| Stage | Steps | HSS | F1 | What changed |
|-------|-------|-----|-----|-------------|
| After Step 1 | 125k | 0.281 | 0.156 | Learned geometry from 2048 pts |
| After Step 2 | 135k | 0.351 | 0.190 | +74% coverage from 4096 pts |
| After Step 3 | 170k | **0.382** | **0.411** | Vertex precision from endpoint loss |

## Why This Works

1. **2048 training has low overfitting** (1.19x train/val ratio) — the model
   learns good representations without memorizing training samples.

2. **4096 data has higher coverage ceiling** (74% vs 66% structural points) —
   more of the building surface is observed, improving vertex recall.

3. **Gentle finetuning preserves representations** — at lr=3e-5, the model
   keeps its learned geometry understanding while adapting to the extra input.

4. **Endpoint loss tightens vertices** — the symmetric endpoint distance
   directly penalizes vertex position errors, which sinkhorn loss alone
   doesn't do (it operates on midpoint/direction/length parametrization).

## What Doesn't Work

- **Training 4096 from scratch:** overfits (1.47x train/val gap), peaks at 0.346
- **BuildingWorld pretraining:** representations are orthogonal to S23DR (cosine sim = 0.05)
- **Mixed BW+S23DR training:** BW data hurts due to domain gap
- **High dropout / weight decay:** prevents overfitting but causes underfitting
- **High finetune LR (>1e-4):** catastrophic forgetting of 2048 representations
- **Steeper cooldown (1e-5, 20x drop):** slightly worse than 3e-5 for this checkpoint

## Reproduction Results

### End-to-end reproductions

| Model | HSS | F1 | IoU | Notes |
|-------|-----|-----|-----|-------|
| Original | 0.382 | 0.414 | 0.370 | Shipped checkpoint |
| E2E repro #4 | 0.379 | 0.409 | 0.369 | Closest E2E, `repro_runs/e2e_repro4_hss379/` |
| Compiled repro (from submission codebase) | 0.376 | — | — | Best compiled repro from this codebase, `repro_runs/compiled_repro_hss376/` |
| E2E repro #3 | 0.375 | 0.404 | 0.367 | |
| Deterministic E2E | 0.372 | 0.398 | 0.368 | Bit-reproducible, `repro_runs/deterministic_hss372/` |
| E2E repro #5 | 0.349 | 0.373 | — | Outlier (early compile divergence) |

### Partial reproductions (isolating pipeline stages)

| Test | Starting from | HSS | Gap to original |
|------|--------------|-----|-----------------|
| Step 3 from orig Step 2 (run A) | Original step135000.pt | 0.382 | 0.000 |
| Step 3 from orig Step 2 (run B) | Original step135000.pt | 0.384 | +0.002 |
| Step 2+3 from orig Step 1 | Original step125000.pt | 0.377 | -0.005 |
| Step 1 from orig step 100k | Original step100000.pt | 0.285 (Step 1 HSS) | +0.004 vs 0.281 |

Step 3 from the same checkpoint reproduces to within 0.002. The E2E variance
(0.349-0.379) is dominated by torch.compile nondeterminism in Step 1.

### All benchmarks

| Model | Input | HSS | F1 | IoU | Notes |
|-------|-------|-----|-----|-----|-------|
| Handcrafted baseline | raw views | 0.307 | 0.404 | 0.260 | |
| h256+qk+ep (submitted) | 2048 | 0.365 | 0.388 | 0.360 | HSS=0.427 on test |
| Original 3-step | 2048 | 0.373 | 0.404 | 0.363 | |
| Original 3-step | 4096 | 0.382 | 0.414 | 0.370 | Best ever |
| Step3 repro from orig S2 | 4096 | 0.384 | 0.414 | — | Near-exact repro |
| E2E repro #4 | 4096 | 0.379 | 0.409 | 0.369 | |
| Compiled repro (submission codebase) | 4096 | 0.376 | — | — | Best compiled from this exact codebase |
| E2E repro #3 | 4096 | 0.375 | 0.404 | 0.367 | |
| Deterministic E2E | 4096 | 0.372 | 0.398 | 0.368 | Bit-reproducible |

## Code Equivalence Verification

| Test | Result |
|------|--------|
| Forward pass (same checkpoint, same input) | Bit-identical (0.00 diff) |
| Loss computation | Bit-identical (0.00 diff) |
| Gradient computation | 5e-8 max diff |
| Training from same seed | Bit-identical steps 1-44 |
| Step 3 from same checkpoint (2 runs) | HSS=0.382, 0.384 |
| Deterministic mode (2 runs) | Bit-identical (0.00 diff) |

## Reproducibility Notes

**Default mode** (`reproduce.sh`): Uses torch.compile (~3x faster). Each run
gets different Triton kernels, causing ~1e-8 floating-point divergence at a
random step (31-45). This grows through chaotic SGD dynamics, giving HSS
variance of ~0.03 across runs. E2E reproductions land in the 0.349-0.379 range.

**Deterministic mode** (`--deterministic` flag): Disables torch.compile.
Bit-identical across runs with the same seed. HSS=0.372 (slightly lower than
compiled mode because eager-mode kernels follow a different numerical path).

**bad_samples.txt**: The shipped file has 156 entries to match original training.
(Note: `wc -l` reports 155 because the last line lacks a trailing newline.)
Two additional bad samples (`47b0e0ce19b`, `4b2d56eb3ef`) were discovered after
the original training run. They are legitimately bad (misaligned GT) but were
included in the original training data. Adding them changes the batch iteration
order and costs ~0.005 HSS in deterministic mode (0.372 -> 0.367) and ~0.04 in
compiled mode due to compounded torch.compile variance. Participants training
from scratch may wish to add these 2 entries for cleaner training data, but
should expect slightly different scores due to the changed iteration order.

The shipped `checkpoint.pt` is from the original training run (HSS=0.382).