---
license: apache-2.0
tags:
  - geometric-deep-learning
  - vision
  - multi-expert
  - patchwork
  - hypersphere
  - from-scratch
---

# GeoLIP ViT Base x3

Geometric vision system: 3-expert consensus soup + from-scratch ViT encoder.

## Components

### 1. Base Tier Soup (teacher)

800K parameter geometric fusion of 3 pretrained vision experts on a 128-d hypersphere.

| Expert | Architecture | Training | Dim |
|--------|-------------|----------|-----|
| clip_l14_openai | ViT-L/14 | Text-supervised (CLIP) | 768 |
| dinov2_b14 | ViT-B/14 | Self-supervised (DINO) | 768 |
| siglip_b16_384 | ViT-B/16 | Sigmoid contrastive (SigLIP) | 768 |

**Pipeline:** GPA alignment at 768-d → PCA to 128-d → per-expert whitened Procrustes calibration → Procrustes-initialized projectors → geometric autograd training.

| Metric | Value |
|--------|-------|
| mAP (COCO) | 0.837 |
| Parameters | 799,952 |
| Anchors | 256 × 128-d |
| Consensus CV (768-d) | 0.2793 |
| Consensus CV (128-d) | 0.2731 |
| Optimizer | Adam, no weight decay |

### 2. From-Scratch ViT Encoder (student)

11M parameter ViT trained from Xavier initialization against the soup's consensus targets. No pretrained weights anywhere. Same architecture pattern as CaptionBERT.

| Config | Value |
|--------|-------|
| Layers | 6 |
| Hidden dim | 384 |
| Heads | 6 |
| FFN dim | 1536 |
| Patch size | 16 |
| Image size | 224 |
| Output dim | 128 (on hypersphere) |
| Parameters | 11,216,768 |

**Training:** Raw COCO images → encoder → 128-d embedding → frozen soup pipeline (constellation + patchwork + classifier) → BCE loss. Additional losses: InfoNCE + MSE against consensus targets, whitened Procrustes alignment, pentachoron CV (calibrated to measured consensus).

#### Results (20 epochs, still converging)

| Metric | E1 | E10 | E20 |
|--------|-----|------|------|
| nce_acc | 0.340 | 0.887 | 0.972 |
| cos→consensus | 0.325 | 0.557 | 0.599 |
| R@1 (5K) | 0.032 | 0.254 | 0.323 |
| mAP | 0.151 | 0.380 | 0.429 |
| F1 | 0.162 | 0.361 | 0.418 |
| Active anchors | 95 | 96 | 94 |

All metrics still climbing at E20. Model needs 60-90 epochs to fully converge (matching CaptionBERT's text encoder trajectory).

## Architecture
```
Training (soup as teacher):
  3 expert features → Procrustes projectors → mean → L2-norm → 128-d consensus targets
  Raw images → from-scratch ViT → 128-d embedding
  Losses: InfoNCE + MSE + CV + BCE(through frozen soup) + Procrustes alignment
  Geometric autograd: tangential=0.01, separation=1.0

Inference (standalone):
  Raw image → ViT encoder → 128-d embedding (on hypersphere)
  No experts needed. Geometry is baked in.
```

## Key Findings

- 800K soup params beat 81.7M (34-expert soup at 0.732 mAP) and 75.6M (34-expert bank at 0.782 mAP)
- Proper calibration (GPA + whitened Procrustes + measured CV target) is essential — without it, constellation collapses to 1/256 active anchors
- From-scratch ViT learns the 3-expert consensus representation from raw pixels with the same convergence dynamics as CaptionBERT on text
- Cross-model weight cosine is 0.000 but activation Procrustes is 0.999 — the models encode identical geometry through completely different weight configurations

## Files

- `base_tier_soup_calibrated.pt` — Trained soup (teacher)
- `geolip_vit_encoder_e20.pt` — ViT encoder at epoch 20
- `base_tier_soup_calibrated.py` — Soup training script
- `vit_encoder_from_scratch.py` — Encoder training script
- `runs/` — Tensorboard logs

## Data

- Training features: [AbstractPhil/bulk-coco-features](https://huggingface.co/datasets/AbstractPhil/bulk-coco-features)
- Images: COCO 2017 (118K train, 5K val)

## Usage
```python
import torch

# Load encoder
ckpt = torch.load("geolip_vit_encoder_e20.pt", weights_only=False)
# ckpt["encoder_state_dict"] — model weights
# ckpt["config"] — architecture config
# ckpt["mAP"], ckpt["cos"], ckpt["r1"] — metrics
```