AbstractPhil
/

geolip-vit-base-x3

@@ -1,198 +1,111 @@
 ---
 license: apache-2.0
 ---
-# Experiment 2.5 Update: COCO convergence is slow but steady.
-BCE loss isn't the best catalyst for geometry but it does work to funnel through an aligned transformer.
-I underestimated the complexity of associative cross-modal differences, but it is converging. Shared space is a very tricky
-catalyst to teach as an associative connection. Routing is easy, distilling is not as easy with multimodal structures and multiple
-adjacent representations used as loss learning targets.
-If this fails to meet direct expectations, I'll form a proper hub and teach using the bertenstein method. Bertenstein works because it's
-always expecting to hear from the experts and there is always one anchored expert in charge.
-The expert/student distillation process requires skilled teachers with similar utility, which is different than simply funneling
-information through a route and pooling it.
-geolip-captionbert-8192 accepts this pooled funneled information and produces useful output due to the shared
-expert informations having similar access utilities.
-In either case, geolip-captionbert-8192 was trained from scratch and so is this model. They are not inheriting weights from any large-training,
-they are inheriting the geometry and structure through distillation in order to represent complex structure that quite simply should not exist
-in the smaller model by direct implicit learning.
-geolip-vit-x3 must learn to predict the pixel data using the output of the experts as markers for loss, which means it can never get a full picture of anything outside of it's own tools.
-This model is exceptionally small, absurdly small even by vit standards. This is because even at this size, this is too much. The model cannot
-overfit if the model uses every tool at the expense, this model will train indefinitely unless a cascade overflow happens, a math continuity corruption
-occurs, or the substructure collapses to a simpler shortcut-centric behavior that would require scrambling.
-The anchors are strong enough and tuned to the experts, the external losses are tuned to teach the expert responses, the expert data is used
-as loss methods of attenuation, and the structure conforms to those losses specifically because it's required to teach the model tobe
-standalone and compliant without requiring the experts later.
-I gave the model everything I could geometrically, and it must discover the way to connect them.
-I'm teaching siglip, dinovit and clip-vit to communicate on the same manifold. They are essentially speaking three dialects of foreign offshoot
-evolved thousand year later Roman.
-The fact that this works at all is a testament to the hypersphere attenuation.
-```
-=================================================================
-GEOLIP VISION ENCODER — FROM SCRATCH
-  ViT: 6L/384d/6h, patch16
-  196 patches + CLS → 128-d output
-  Device: cuda
-=================================================================
-  Loading soup...
-  Soup: mAP=0.837 CV_target=0.2731
-  train: loaded cached targets (118,287)
-  val: loaded cached targets (5,000)
-  Caching train images (118,287)...
-=================================================================
-BUILD ENCODER
-=================================================================
-  Architecture: 6L/384d/6h, patch16
-  Input: 224×224 → 196 patches
-  Output: 128-d (on hypersphere)
-  Parameters: 11,216,768
-=================================================================
-TRAINING
-  20 epochs, lr=0.0003, batch=48
-  Losses: InfoNCE + MSE + CV + BCE + Procrustes alignment
-  CV target: 0.2731
-  Images: train=118,287 val=5,000 (cached as tensors)
-=================================================================
-E 1/20 train: 100%|██████████| 2465/2465 [02:44<00:00, 14.97batch/s, cos=0.258, loss=2.6911, nce_acc=0.339, ordered=1]
-  E1 train: 165s loss=2.6891 nce=2.2529 mse=0.0120 bce=0.1963 nce_acc=0.340
-  E1 val:   mAP=0.151 F1=0.162 R@1=0.032 cos=0.325 cv=0.2663 anchors=95/256 seen=5000/5000 ★
-E 2/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.32batch/s, cos=0.368, loss=1.7954, nce_acc=0.553, ordered=1]
-  E2 train: 161s loss=1.7948 nce=1.4297 mse=0.0099 bce=0.1473 nce_acc=0.553
-  E2 val:   mAP=0.206 F1=0.197 R@1=0.062 cos=0.390 cv=0.2552 anchors=99/256 seen=5000/5000 ★
-E 3/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.37batch/s, cos=0.416, loss=1.4860, nce_acc=0.641, ordered=1]
-  E3 train: 160s loss=1.4854 nce=1.1484 mse=0.0092 bce=0.1338 nce_acc=0.641
-  E3 val:   mAP=0.246 F1=0.244 R@1=0.091 cos=0.427 cv=0.2234 anchors=98/256 seen=5000/5000 ★
-E 4/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.448, loss=1.2913, nce_acc=0.695, ordered=1]
-  E4 train: 160s loss=1.2910 nce=0.9727 mse=0.0087 bce=0.1265 nce_acc=0.695
-  E4 val:   mAP=0.272 F1=0.266 R@1=0.113 cos=0.453 cv=0.2078 anchors=99/256 seen=5000/5000 ★
-E 5/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.475, loss=1.1334, nce_acc=0.743, ordered=1]
-  E5 train: 160s loss=1.1331 nce=0.8303 mse=0.0083 bce=0.1205 nce_acc=0.743
-  E5 val:   mAP=0.296 F1=0.292 R@1=0.139 cos=0.473 cv=0.2133 anchors=98/256 seen=5000/5000 ★
-E 6/20 train: 100%|██████████| 2465/2465 [02:37<00:00, 15.63batch/s, cos=0.499, loss=1.0005, nce_acc=0.784, ordered=1]
-  E6 train: 158s loss=1.0003 nce=0.7111 mse=0.0079 bce=0.1158 nce_acc=0.784
-  E6 val:   mAP=0.317 F1=0.311 R@1=0.164 cos=0.495 cv=0.1835 anchors=98/256 seen=5000/5000 ★
-E 7/20 train: 100%|██████████| 2465/2465 [02:38<00:00, 15.60batch/s, cos=0.520, loss=0.8947, nce_acc=0.815, ordered=1]
-  E7 train: 158s loss=0.8943 nce=0.6172 mse=0.0075 bce=0.1115 nce_acc=0.815
-  E7 val:   mAP=0.337 F1=0.335 R@1=0.190 cos=0.513 cv=0.1809 anchors=96/256 seen=5000/5000 ★
-E 8/20 train: 100%|██████████| 2465/2465 [02:38<00:00, 15.59batch/s, cos=0.539, loss=0.8030, nce_acc=0.842, ordered=1]
-  E8 train: 158s loss=0.8028 nce=0.5365 mse=0.0072 bce=0.1076 nce_acc=0.843
-  E8 val:   mAP=0.344 F1=0.331 R@1=0.207 cos=0.523 cv=0.1779 anchors=95/256 seen=5000/5000 ★
-E 9/20 train: 100%|██████████| 2465/2465 [02:38<00:00, 15.58batch/s, cos=0.557, loss=0.7229, nce_acc=0.866, ordered=1]
-  E9 train: 158s loss=0.7228 nce=0.4665 mse=0.0070 bce=0.1041 nce_acc=0.866
-  E9 val:   mAP=0.361 F1=0.349 R@1=0.218 cos=0.537 cv=0.1764 anchors=95/256 seen=5000/5000 ★
-E10/20 train:  36%|███▌      | 892/2465 [00:57<01:40, 15.69batch/s, cos=0.572, loss=0.6548, nce_acc=0.887, ordered=1]
-```
-# Experiment 2.5:
-The xavier aligned and procrustes embedding array attached to a standard patch16 subset should suffice.
-I'll be training this like CaptionBERT but with a twist, the soup expert is the alignment bank for this one, and I trained it first instead of later.
-The alignment and R1 is nearly perfect, so it should be cohesive enough through the chain of conceptualization to coalesce through the implications.
-Now it's another story, if the actual patches will learn based on the embedding and encoding spectrum, and how quickly I can make them learn.
-The output this encoder produces is a 128 dimensional enriched representational lookup plane on a hypersphere.
-This is more than enough information to house access to any data route that exists.
-The dimensional spectrum of a 5d object is so expansive and so enriched, that the entire spectrum of this shape requires a specific
-curation of the behavior. This is what most of the mechanisms are tasked with overall, pruning the effect of rigidity indifference preservation
-on the hypersphere represented structure.
-In other words, that 128 dimensions represents more information than I could express with words.
-# Experiment 2:
-95/256 anchors survive, emergent geometric structure formed.
-R@1= 97.1%, not quite but getting there. Experiment 2 was successful enough to push harder in this direction.
-Anchor collapse says it doesn't need all those anchors. It started grabbing at more by the end, which means
-the system was aligned and then started growing further on a constraint that I was unaware of.
-This drift curve needs to be controlled. Direct anchored emergence while training is risky. The bank itself
-survived so well because it was anchored post training, which gave added cohesion and complexity association
-that I have yet to discover the runtime process to train. I will be analyzing the emergence to preserve the anchoring.
 ```
-=================================================================
-PHASE 5: TRAINING
-  20 epochs, lr=0.001, CV target=0.2731
-=================================================================
-  E 1: mAP=0.788 F1=0.731 R@1=0.971 cos=0.806 cv=0.1213 anchors=226/256 nce=0.999 loss=0.1676 ★
-  E 2: mAP=0.803 F1=0.742 R@1=0.971 cos=0.809 cv=0.1178 anchors=200/256 nce=0.999 loss=0.1459 ★
-  E 3: mAP=0.810 F1=0.735 R@1=0.973 cos=0.808 cv=0.1197 anchors=161/256 nce=0.999 loss=0.1431 ★
-  E 4: mAP=0.817 F1=0.752 R@1=0.971 cos=0.811 cv=0.1262 anchors=131/256 nce=0.999 loss=0.1404 ★
-  E 5: mAP=0.823 F1=0.755 R@1=0.971 cos=0.812 cv=0.1232 anchors=113/256 nce=0.999 loss=0.1389 ★
-  E 6: mAP=0.825 F1=0.755 R@1=0.972 cos=0.815 cv=0.1105 anchors=104/256 nce=0.999 loss=0.1379 ★
-  E 7: mAP=0.827 F1=0.767 R@1=0.970 cos=0.814 cv=0.1125 anchors=101/256 nce=0.999 loss=0.1369 ★
-  E 8: mAP=0.829 F1=0.763 R@1=0.971 cos=0.815 cv=0.1239 anchors=99/256 nce=0.999 loss=0.1361 ★
-  E 9: mAP=0.832 F1=0.764 R@1=0.972 cos=0.815 cv=0.1164 anchors=98/256 nce=0.999 loss=0.1355 ★
-  E10: mAP=0.833 F1=0.765 R@1=0.968 cos=0.814 cv=0.1166 anchors=99/256 nce=0.999 loss=0.1345 ★
-  E11: mAP=0.834 F1=0.763 R@1=0.971 cos=0.814 cv=0.1214 anchors=98/256 nce=0.999 loss=0.1346 ★
-  E12: mAP=0.833 F1=0.764 R@1=0.973 cos=0.813 cv=0.1200 anchors=95/256 nce=0.999 loss=0.1343
-  E13: mAP=0.836 F1=0.761 R@1=0.972 cos=0.813 cv=0.1081 anchors=94/256 nce=0.999 loss=0.1338 ★
-  E14: mAP=0.836 F1=0.772 R@1=0.973 cos=0.812 cv=0.1170 anchors=95/256 nce=0.999 loss=0.1334
-  E15: mAP=0.835 F1=0.774 R@1=0.970 cos=0.812 cv=0.1223 anchors=95/256 nce=0.999 loss=0.1338
-  E16: mAP=0.837 F1=0.777 R@1=0.968 cos=0.812 cv=0.1225 anchors=96/256 nce=1.000 loss=0.1339 ★
-  E17: mAP=0.834 F1=0.772 R@1=0.973 cos=0.811 cv=0.1089 anchors=95/256 nce=0.999 loss=0.1327
-  E18: mAP=0.834 F1=0.770 R@1=0.973 cos=0.812 cv=0.1156 anchors=95/256 nce=0.999 loss=0.1321
-  E19: mAP=0.834 F1=0.773 R@1=0.970 cos=0.811 cv=0.1224 anchors=96/256 nce=0.999 loss=0.1328
-  E20: mAP=0.835 F1=0.770 R@1=0.971 cos=0.812 cv=0.1159 anchors=96/256 nce=0.999 loss=0.1328
-  Best mAP: 0.837
-  CV target: 0.2731
 ```
-# Experiment 1:
-Total collapse. The three models did not conform and the patchwork did not learn. The objectives are not correct.
-One anchor was defaulted to, none of the others utilized. The memory bank solves this problem through queue assessment with the INFONCE hub processing,
-but this model is a different form of anchoring that did not work.
-THE ENTIRE MODEL became the anchor, instead of the anchorpoints within the model. I'm thinking there wasn't enough scattering, so I'll try some additional tweaks.
-## Post
-```
-Active anchors: 1/256 (0.4%)
-Every single image → anchor 65
-Anchor entropy: 0.0000
-Anchors within cos>0.5 per image: 1.0
-Nearest anchor dist: 0.016 — next nearest: 0.665
-Effective dim: 23.6/128
-Top-20 SVs explain 99.2%
-Self-sim off-diag: 0.969
-Expert uniqueness: 0.0008–0.0011
 ```
-There is only one active anchor, which is essentially CLS. The uniqueness collapsed. The distance is fine, the entropy is dead.
-Shortcut bypass, additional nonlinearity must be made.
-## Assessment
-Without the centered procrustes loss the same result happened. The collapse forms around one of the earlier anchors, around the outside middlepoint of
-where all three models are simultaneously rotating around a point, which is not the direct center.
-This point has noise, invalidity, incorrect association, and additional problems based on the attention mechanisms internally to the models queried.
-## Hypothesis based on research
-The procrustes alignment must align centerwise, and it must be defined specifically to specifications.

 ---
 license: apache-2.0
+tags:
+  - geometric-deep-learning
+  - vision
+  - multi-expert
+  - patchwork
+  - hypersphere
+  - from-scratch
 ---
+# GeoLIP ViT Base x3
+Geometric vision system: 3-expert consensus soup + from-scratch ViT encoder.
+## Components
+### 1. Base Tier Soup (teacher)
+800K parameter geometric fusion of 3 pretrained vision experts on a 128-d hypersphere.
+| Expert | Architecture | Training | Dim |
+|--------|-------------|----------|-----|
+| clip_l14_openai | ViT-L/14 | Text-supervised (CLIP) | 768 |
+| dinov2_b14 | ViT-B/14 | Self-supervised (DINO) | 768 |
+| siglip_b16_384 | ViT-B/16 | Sigmoid contrastive (SigLIP) | 768 |
+**Pipeline:** GPA alignment at 768-d → PCA to 128-d → per-expert whitened Procrustes calibration → Procrustes-initialized projectors → geometric autograd training.
+| Metric | Value |
+|--------|-------|
+| mAP (COCO) | 0.837 |
+| Parameters | 799,952 |
+| Anchors | 256 × 128-d |
+| Consensus CV (768-d) | 0.2793 |
+| Consensus CV (128-d) | 0.2731 |
+| Optimizer | Adam, no weight decay |
+### 2. From-Scratch ViT Encoder (student)
+11M parameter ViT trained from Xavier initialization against the soup's consensus targets. No pretrained weights anywhere. Same architecture pattern as CaptionBERT.
+| Config | Value |
+|--------|-------|
+| Layers | 6 |
+| Hidden dim | 384 |
+| Heads | 6 |
+| FFN dim | 1536 |
+| Patch size | 16 |
+| Image size | 224 |
+| Output dim | 128 (on hypersphere) |
+| Parameters | 11,216,768 |
+**Training:** Raw COCO images → encoder → 128-d embedding → frozen soup pipeline (constellation + patchwork + classifier) → BCE loss. Additional losses: InfoNCE + MSE against consensus targets, whitened Procrustes alignment, pentachoron CV (calibrated to measured consensus).
+#### Results (20 epochs, still converging)
+| Metric | E1 | E10 | E20 |
+|--------|-----|------|------|
+| nce_acc | 0.340 | 0.887 | 0.972 |
+| cos→consensus | 0.325 | 0.557 | 0.599 |
+| R@1 (5K) | 0.032 | 0.254 | 0.323 |
+| mAP | 0.151 | 0.380 | 0.429 |
+| F1 | 0.162 | 0.361 | 0.418 |
+| Active anchors | 95 | 96 | 94 |
+All metrics still climbing at E20. Model needs 60-90 epochs to fully converge (matching CaptionBERT's text encoder trajectory).
+## Architecture
 ```
+Training (soup as teacher):
+  3 expert features → Procrustes projectors → mean → L2-norm → 128-d consensus targets
+  Raw images → from-scratch ViT → 128-d embedding
+  Losses: InfoNCE + MSE + CV + BCE(through frozen soup) + Procrustes alignment
+  Geometric autograd: tangential=0.01, separation=1.0
+Inference (standalone):
+  Raw image → ViT encoder → 128-d embedding (on hypersphere)
+  No experts needed. Geometry is baked in.
 ```
+## Key Findings
+- 800K soup params beat 81.7M (34-expert soup at 0.732 mAP) and 75.6M (34-expert bank at 0.782 mAP)
+- Proper calibration (GPA + whitened Procrustes + measured CV target) is essential — without it, constellation collapses to 1/256 active anchors
+- From-scratch ViT learns the 3-expert consensus representation from raw pixels with the same convergence dynamics as CaptionBERT on text
+- Cross-model weight cosine is 0.000 but activation Procrustes is 0.999 — the models encode identical geometry through completely different weight configurations
+## Files
+- `base_tier_soup_calibrated.pt` — Trained soup (teacher)
+- `geolip_vit_encoder_e20.pt` — ViT encoder at epoch 20
+- `base_tier_soup_calibrated.py` — Soup training script
+- `vit_encoder_from_scratch.py` — Encoder training script
+- `runs/` — Tensorboard logs
+## Data
+- Training features: [AbstractPhil/bulk-coco-features](https://huggingface.co/datasets/AbstractPhil/bulk-coco-features)
+- Images: COCO 2017 (118K train, 5K val)
+## Usage
+```python
+import torch
+# Load encoder
+ckpt = torch.load("geolip_vit_encoder_e20.pt", weights_only=False)
+# ckpt["encoder_state_dict"] — model weights
+# ckpt["config"] — architecture config
+# ckpt["mAP"], ckpt["cos"], ckpt["r1"] — metrics
 ```