--- license: apache-2.0 tags: - geometric-deep-learning - vision - multi-expert - patchwork - hypersphere - from-scratch --- # GeoLIP ViT Base x3 Geometric vision system: 3-expert consensus soup + from-scratch ViT encoder. ## Components ### 1. Base Tier Soup (teacher) 800K parameter geometric fusion of 3 pretrained vision experts on a 128-d hypersphere. | Expert | Architecture | Training | Dim | |--------|-------------|----------|-----| | clip_l14_openai | ViT-L/14 | Text-supervised (CLIP) | 768 | | dinov2_b14 | ViT-B/14 | Self-supervised (DINO) | 768 | | siglip_b16_384 | ViT-B/16 | Sigmoid contrastive (SigLIP) | 768 | **Pipeline:** GPA alignment at 768-d → PCA to 128-d → per-expert whitened Procrustes calibration → Procrustes-initialized projectors → geometric autograd training. | Metric | Value | |--------|-------| | mAP (COCO) | 0.837 | | Parameters | 799,952 | | Anchors | 256 × 128-d | | Consensus CV (768-d) | 0.2793 | | Consensus CV (128-d) | 0.2731 | | Optimizer | Adam, no weight decay | ### 2. From-Scratch ViT Encoder (student) 11M parameter ViT trained from Xavier initialization against the soup's consensus targets. No pretrained weights anywhere. Same architecture pattern as CaptionBERT. | Config | Value | |--------|-------| | Layers | 6 | | Hidden dim | 384 | | Heads | 6 | | FFN dim | 1536 | | Patch size | 16 | | Image size | 224 | | Output dim | 128 (on hypersphere) | | Parameters | 11,216,768 | **Training:** Raw COCO images → encoder → 128-d embedding → frozen soup pipeline (constellation + patchwork + classifier) → BCE loss. Additional losses: InfoNCE + MSE against consensus targets, whitened Procrustes alignment, pentachoron CV (calibrated to measured consensus). #### Results (20 epochs, still converging) | Metric | E1 | E10 | E20 | |--------|-----|------|------| | nce_acc | 0.340 | 0.887 | 0.972 | | cos→consensus | 0.325 | 0.557 | 0.599 | | R@1 (5K) | 0.032 | 0.254 | 0.323 | | mAP | 0.151 | 0.380 | 0.429 | | F1 | 0.162 | 0.361 | 0.418 | | Active anchors | 95 | 96 | 94 | All metrics still climbing at E20. Model needs 60-90 epochs to fully converge (matching CaptionBERT's text encoder trajectory). ## Architecture ``` Training (soup as teacher): 3 expert features → Procrustes projectors → mean → L2-norm → 128-d consensus targets Raw images → from-scratch ViT → 128-d embedding Losses: InfoNCE + MSE + CV + BCE(through frozen soup) + Procrustes alignment Geometric autograd: tangential=0.01, separation=1.0 Inference (standalone): Raw image → ViT encoder → 128-d embedding (on hypersphere) No experts needed. Geometry is baked in. ``` ## Key Findings - 800K soup params beat 81.7M (34-expert soup at 0.732 mAP) and 75.6M (34-expert bank at 0.782 mAP) - Proper calibration (GPA + whitened Procrustes + measured CV target) is essential — without it, constellation collapses to 1/256 active anchors - From-scratch ViT learns the 3-expert consensus representation from raw pixels with the same convergence dynamics as CaptionBERT on text - Cross-model weight cosine is 0.000 but activation Procrustes is 0.999 — the models encode identical geometry through completely different weight configurations ## Files - `base_tier_soup_calibrated.pt` — Trained soup (teacher) - `geolip_vit_encoder_e20.pt` — ViT encoder at epoch 20 - `base_tier_soup_calibrated.py` — Soup training script - `vit_encoder_from_scratch.py` — Encoder training script - `runs/` — Tensorboard logs ## Data - Training features: [AbstractPhil/bulk-coco-features](https://huggingface.co/datasets/AbstractPhil/bulk-coco-features) - Images: COCO 2017 (118K train, 5K val) ## Usage ```python import torch # Load encoder ckpt = torch.load("geolip_vit_encoder_e20.pt", weights_only=False) # ckpt["encoder_state_dict"] — model weights # ckpt["config"] — architecture config # ckpt["mAP"], ckpt["cos"], ckpt["r1"] — metrics ```