| --- |
| license: apache-2.0 |
| tags: |
| - geometric-deep-learning |
| - vision |
| - multi-expert |
| - patchwork |
| - hypersphere |
| - from-scratch |
| --- |
| |
| # GeoLIP ViT Base x3 |
|
|
| Geometric vision system: 3-expert consensus soup + from-scratch ViT encoder. |
|
|
| ## Components |
|
|
| ### 1. Base Tier Soup (teacher) |
|
|
| 800K parameter geometric fusion of 3 pretrained vision experts on a 128-d hypersphere. |
|
|
| | Expert | Architecture | Training | Dim | |
| |--------|-------------|----------|-----| |
| | clip_l14_openai | ViT-L/14 | Text-supervised (CLIP) | 768 | |
| | dinov2_b14 | ViT-B/14 | Self-supervised (DINO) | 768 | |
| | siglip_b16_384 | ViT-B/16 | Sigmoid contrastive (SigLIP) | 768 | |
| |
| **Pipeline:** GPA alignment at 768-d β PCA to 128-d β per-expert whitened Procrustes calibration β Procrustes-initialized projectors β geometric autograd training. |
| |
| | Metric | Value | |
| |--------|-------| |
| | mAP (COCO) | 0.837 | |
| | Parameters | 799,952 | |
| | Anchors | 256 Γ 128-d | |
| | Consensus CV (768-d) | 0.2793 | |
| | Consensus CV (128-d) | 0.2731 | |
| | Optimizer | Adam, no weight decay | |
| |
| ### 2. From-Scratch ViT Encoder (student) |
| |
| 11M parameter ViT trained from Xavier initialization against the soup's consensus targets. No pretrained weights anywhere. Same architecture pattern as CaptionBERT. |
| |
| | Config | Value | |
| |--------|-------| |
| | Layers | 6 | |
| | Hidden dim | 384 | |
| | Heads | 6 | |
| | FFN dim | 1536 | |
| | Patch size | 16 | |
| | Image size | 224 | |
| | Output dim | 128 (on hypersphere) | |
| | Parameters | 11,216,768 | |
| |
| **Training:** Raw COCO images β encoder β 128-d embedding β frozen soup pipeline (constellation + patchwork + classifier) β BCE loss. Additional losses: InfoNCE + MSE against consensus targets, whitened Procrustes alignment, pentachoron CV (calibrated to measured consensus). |
| |
| #### Results (20 epochs, still converging) |
| |
| | Metric | E1 | E10 | E20 | |
| |--------|-----|------|------| |
| | nce_acc | 0.340 | 0.887 | 0.972 | |
| | cosβconsensus | 0.325 | 0.557 | 0.599 | |
| | R@1 (5K) | 0.032 | 0.254 | 0.323 | |
| | mAP | 0.151 | 0.380 | 0.429 | |
| | F1 | 0.162 | 0.361 | 0.418 | |
| | Active anchors | 95 | 96 | 94 | |
|
|
| All metrics still climbing at E20. Model needs 60-90 epochs to fully converge (matching CaptionBERT's text encoder trajectory). |
|
|
| ## Architecture |
| ``` |
| Training (soup as teacher): |
| 3 expert features β Procrustes projectors β mean β L2-norm β 128-d consensus targets |
| Raw images β from-scratch ViT β 128-d embedding |
| Losses: InfoNCE + MSE + CV + BCE(through frozen soup) + Procrustes alignment |
| Geometric autograd: tangential=0.01, separation=1.0 |
| |
| Inference (standalone): |
| Raw image β ViT encoder β 128-d embedding (on hypersphere) |
| No experts needed. Geometry is baked in. |
| ``` |
|
|
| ## Key Findings |
|
|
| - 800K soup params beat 81.7M (34-expert soup at 0.732 mAP) and 75.6M (34-expert bank at 0.782 mAP) |
| - Proper calibration (GPA + whitened Procrustes + measured CV target) is essential β without it, constellation collapses to 1/256 active anchors |
| - From-scratch ViT learns the 3-expert consensus representation from raw pixels with the same convergence dynamics as CaptionBERT on text |
| - Cross-model weight cosine is 0.000 but activation Procrustes is 0.999 β the models encode identical geometry through completely different weight configurations |
|
|
| ## Files |
|
|
| - `base_tier_soup_calibrated.pt` β Trained soup (teacher) |
| - `geolip_vit_encoder_e20.pt` β ViT encoder at epoch 20 |
| - `base_tier_soup_calibrated.py` β Soup training script |
| - `vit_encoder_from_scratch.py` β Encoder training script |
| - `runs/` β Tensorboard logs |
|
|
| ## Data |
|
|
| - Training features: [AbstractPhil/bulk-coco-features](https://huggingface.co/datasets/AbstractPhil/bulk-coco-features) |
| - Images: COCO 2017 (118K train, 5K val) |
|
|
| ## Usage |
| ```python |
| import torch |
| |
| # Load encoder |
| ckpt = torch.load("geolip_vit_encoder_e20.pt", weights_only=False) |
| # ckpt["encoder_state_dict"] β model weights |
| # ckpt["config"] β architecture config |
| # ckpt["mAP"], ckpt["cos"], ckpt["r1"] β metrics |
| ``` |
|
|
|
|
| --- |
| license: apache-2.0 |
| --- |
|
|
| # Experiment 2.5 Update: COCO convergence is slow but steady. |
|
|
| BCE loss isn't the best catalyst for geometry but it does work to funnel through an aligned transformer. |
|
|
| I underestimated the complexity of associative cross-modal differences, but it is converging. Shared space is a very tricky |
| catalyst to teach as an associative connection. Routing is easy, distilling is not as easy with multimodal structures and multiple |
| adjacent representations used as loss learning targets. |
|
|
| If this fails to meet direct expectations, I'll form a proper hub and teach using the bertenstein method. Bertenstein works because it's |
| always expecting to hear from the experts and there is always one anchored expert in charge. |
|
|
| The expert/student distillation process requires skilled teachers with similar utility, which is different than simply funneling |
| information through a route and pooling it. |
|
|
| geolip-captionbert-8192 accepts this pooled funneled information and produces useful output due to the shared |
| expert informations having similar access utilities. |
|
|
| In either case, geolip-captionbert-8192 was trained from scratch and so is this model. They are not inheriting weights from any large-training, |
| they are inheriting the geometry and structure through distillation in order to represent complex structure that quite simply should not exist |
| in the smaller model by direct implicit learning. |
|
|
| geolip-vit-x3 must learn to predict the pixel data using the output of the experts as markers for loss, which means it can never get a full picture of anything outside of it's own tools. |
|
|
| This model is exceptionally small, absurdly small even by vit standards. This is because even at this size, this is too much. The model cannot |
| overfit if the model uses every tool at the expense, this model will train indefinitely unless a cascade overflow happens, a math continuity corruption |
| occurs, or the substructure collapses to a simpler shortcut-centric behavior that would require scrambling. |
|
|
| The anchors are strong enough and tuned to the experts, the external losses are tuned to teach the expert responses, the expert data is used |
| as loss methods of attenuation, and the structure conforms to those losses specifically because it's required to teach the model tobe |
| standalone and compliant without requiring the experts later. |
|
|
| I gave the model everything I could geometrically, and it must discover the way to connect them. |
|
|
| I'm teaching siglip, dinovit and clip-vit to communicate on the same manifold. They are essentially speaking three dialects of foreign offshoot |
| evolved thousand year later Roman. |
|
|
| The fact that this works at all is a testament to the hypersphere attenuation. |
|
|
| ``` |
| ================================================================= |
| GEOLIP VISION ENCODER β FROM SCRATCH |
| ViT: 6L/384d/6h, patch16 |
| 196 patches + CLS β 128-d output |
| Device: cuda |
| ================================================================= |
|
|
| Loading soup... |
| Soup: mAP=0.837 CV_target=0.2731 |
| train: loaded cached targets (118,287) |
| val: loaded cached targets (5,000) |
| Caching train images (118,287)... |
| |
| ================================================================= |
| BUILD ENCODER |
| ================================================================= |
| Architecture: 6L/384d/6h, patch16 |
| Input: 224Γ224 β 196 patches |
| Output: 128-d (on hypersphere) |
| Parameters: 11,216,768 |
| |
| ================================================================= |
| TRAINING |
| 20 epochs, lr=0.0003, batch=48 |
| Losses: InfoNCE + MSE + CV + BCE + Procrustes alignment |
| CV target: 0.2731 |
| Images: train=118,287 val=5,000 (cached as tensors) |
| ================================================================= |
| E 1/20 train: 100%|ββββββββββ| 2465/2465 [02:44<00:00, 14.97batch/s, cos=0.258, loss=2.6911, nce_acc=0.339, ordered=1] |
| E1 train: 165s loss=2.6891 nce=2.2529 mse=0.0120 bce=0.1963 nce_acc=0.340 |
| E1 val: mAP=0.151 F1=0.162 R@1=0.032 cos=0.325 cv=0.2663 anchors=95/256 seen=5000/5000 β
|
| E 2/20 train: 100%|ββββββββββ| 2465/2465 [02:40<00:00, 15.32batch/s, cos=0.368, loss=1.7954, nce_acc=0.553, ordered=1] |
| E2 train: 161s loss=1.7948 nce=1.4297 mse=0.0099 bce=0.1473 nce_acc=0.553 |
| E2 val: mAP=0.206 F1=0.197 R@1=0.062 cos=0.390 cv=0.2552 anchors=99/256 seen=5000/5000 β
|
| E 3/20 train: 100%|ββββββββββ| 2465/2465 [02:40<00:00, 15.37batch/s, cos=0.416, loss=1.4860, nce_acc=0.641, ordered=1] |
| E3 train: 160s loss=1.4854 nce=1.1484 mse=0.0092 bce=0.1338 nce_acc=0.641 |
| E3 val: mAP=0.246 F1=0.244 R@1=0.091 cos=0.427 cv=0.2234 anchors=98/256 seen=5000/5000 β
|
| E 4/20 train: 100%|ββββββββββ| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.448, loss=1.2913, nce_acc=0.695, ordered=1] |
| E4 train: 160s loss=1.2910 nce=0.9727 mse=0.0087 bce=0.1265 nce_acc=0.695 |
| E4 val: mAP=0.272 F1=0.266 R@1=0.113 cos=0.453 cv=0.2078 anchors=99/256 seen=5000/5000 β
|
| E 5/20 train: 100%|ββββββββββ| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.475, loss=1.1334, nce_acc=0.743, ordered=1] |
| E5 train: 160s loss=1.1331 nce=0.8303 mse=0.0083 bce=0.1205 nce_acc=0.743 |
| E5 val: mAP=0.296 F1=0.292 R@1=0.139 cos=0.473 cv=0.2133 anchors=98/256 seen=5000/5000 β
|
| E 6/20 train: 100%|ββββββββββ| 2465/2465 [02:37<00:00, 15.63batch/s, cos=0.499, loss=1.0005, nce_acc=0.784, ordered=1] |
| E6 train: 158s loss=1.0003 nce=0.7111 mse=0.0079 bce=0.1158 nce_acc=0.784 |
| E6 val: mAP=0.317 F1=0.311 R@1=0.164 cos=0.495 cv=0.1835 anchors=98/256 seen=5000/5000 β
|
| E 7/20 train: 100%|ββββββββββ| 2465/2465 [02:38<00:00, 15.60batch/s, cos=0.520, loss=0.8947, nce_acc=0.815, ordered=1] |
| E7 train: 158s loss=0.8943 nce=0.6172 mse=0.0075 bce=0.1115 nce_acc=0.815 |
| E7 val: mAP=0.337 F1=0.335 R@1=0.190 cos=0.513 cv=0.1809 anchors=96/256 seen=5000/5000 β
|
| E 8/20 train: 100%|ββββββββββ| 2465/2465 [02:38<00:00, 15.59batch/s, cos=0.539, loss=0.8030, nce_acc=0.842, ordered=1] |
| E8 train: 158s loss=0.8028 nce=0.5365 mse=0.0072 bce=0.1076 nce_acc=0.843 |
| E8 val: mAP=0.344 F1=0.331 R@1=0.207 cos=0.523 cv=0.1779 anchors=95/256 seen=5000/5000 β
|
| E 9/20 train: 100%|ββββββββββ| 2465/2465 [02:38<00:00, 15.58batch/s, cos=0.557, loss=0.7229, nce_acc=0.866, ordered=1] |
| E9 train: 158s loss=0.7228 nce=0.4665 mse=0.0070 bce=0.1041 nce_acc=0.866 |
| E9 val: mAP=0.361 F1=0.349 R@1=0.218 cos=0.537 cv=0.1764 anchors=95/256 seen=5000/5000 β
|
| E10/20 train: 36%|ββββ | 892/2465 [00:57<01:40, 15.69batch/s, cos=0.572, loss=0.6548, nce_acc=0.887, ordered=1] |
| ``` |
| |
| |
| # Experiment 2.5: |
| The xavier aligned and procrustes embedding array attached to a standard patch16 subset should suffice. |
| |
| I'll be training this like CaptionBERT but with a twist, the soup expert is the alignment bank for this one, and I trained it first instead of later. |
| |
| The alignment and R1 is nearly perfect, so it should be cohesive enough through the chain of conceptualization to coalesce through the implications. |
| |
| Now it's another story, if the actual patches will learn based on the embedding and encoding spectrum, and how quickly I can make them learn. |
| |
| The output this encoder produces is a 128 dimensional enriched representational lookup plane on a hypersphere. |
| This is more than enough information to house access to any data route that exists. |
| |
| The dimensional spectrum of a 5d object is so expansive and so enriched, that the entire spectrum of this shape requires a specific |
| curation of the behavior. This is what most of the mechanisms are tasked with overall, pruning the effect of rigidity indifference preservation |
| on the hypersphere represented structure. |
| |
| In other words, that 128 dimensions represents more information than I could express with words. |
| |
| # Experiment 2: |
| 95/256 anchors survive, emergent geometric structure formed. |
| |
| R@1= 97.1%, not quite but getting there. Experiment 2 was successful enough to push harder in this direction. |
| |
| Anchor collapse says it doesn't need all those anchors. It started grabbing at more by the end, which means |
| the system was aligned and then started growing further on a constraint that I was unaware of. |
| |
| This drift curve needs to be controlled. Direct anchored emergence while training is risky. The bank itself |
| survived so well because it was anchored post training, which gave added cohesion and complexity association |
| that I have yet to discover the runtime process to train. I will be analyzing the emergence to preserve the anchoring. |
| |
| ``` |
| ================================================================= |
| PHASE 5: TRAINING |
| 20 epochs, lr=0.001, CV target=0.2731 |
| ================================================================= |
| E 1: mAP=0.788 F1=0.731 R@1=0.971 cos=0.806 cv=0.1213 anchors=226/256 nce=0.999 loss=0.1676 β
|
| E 2: mAP=0.803 F1=0.742 R@1=0.971 cos=0.809 cv=0.1178 anchors=200/256 nce=0.999 loss=0.1459 β
|
| E 3: mAP=0.810 F1=0.735 R@1=0.973 cos=0.808 cv=0.1197 anchors=161/256 nce=0.999 loss=0.1431 β
|
| E 4: mAP=0.817 F1=0.752 R@1=0.971 cos=0.811 cv=0.1262 anchors=131/256 nce=0.999 loss=0.1404 β
|
| E 5: mAP=0.823 F1=0.755 R@1=0.971 cos=0.812 cv=0.1232 anchors=113/256 nce=0.999 loss=0.1389 β
|
| E 6: mAP=0.825 F1=0.755 R@1=0.972 cos=0.815 cv=0.1105 anchors=104/256 nce=0.999 loss=0.1379 β
|
| E 7: mAP=0.827 F1=0.767 R@1=0.970 cos=0.814 cv=0.1125 anchors=101/256 nce=0.999 loss=0.1369 β
|
| E 8: mAP=0.829 F1=0.763 R@1=0.971 cos=0.815 cv=0.1239 anchors=99/256 nce=0.999 loss=0.1361 β
|
| E 9: mAP=0.832 F1=0.764 R@1=0.972 cos=0.815 cv=0.1164 anchors=98/256 nce=0.999 loss=0.1355 β
|
| E10: mAP=0.833 F1=0.765 R@1=0.968 cos=0.814 cv=0.1166 anchors=99/256 nce=0.999 loss=0.1345 β
|
| E11: mAP=0.834 F1=0.763 R@1=0.971 cos=0.814 cv=0.1214 anchors=98/256 nce=0.999 loss=0.1346 β
|
| E12: mAP=0.833 F1=0.764 R@1=0.973 cos=0.813 cv=0.1200 anchors=95/256 nce=0.999 loss=0.1343 |
| E13: mAP=0.836 F1=0.761 R@1=0.972 cos=0.813 cv=0.1081 anchors=94/256 nce=0.999 loss=0.1338 β
|
| E14: mAP=0.836 F1=0.772 R@1=0.973 cos=0.812 cv=0.1170 anchors=95/256 nce=0.999 loss=0.1334 |
| E15: mAP=0.835 F1=0.774 R@1=0.970 cos=0.812 cv=0.1223 anchors=95/256 nce=0.999 loss=0.1338 |
| E16: mAP=0.837 F1=0.777 R@1=0.968 cos=0.812 cv=0.1225 anchors=96/256 nce=1.000 loss=0.1339 β
|
| E17: mAP=0.834 F1=0.772 R@1=0.973 cos=0.811 cv=0.1089 anchors=95/256 nce=0.999 loss=0.1327 |
| E18: mAP=0.834 F1=0.770 R@1=0.973 cos=0.812 cv=0.1156 anchors=95/256 nce=0.999 loss=0.1321 |
| E19: mAP=0.834 F1=0.773 R@1=0.970 cos=0.811 cv=0.1224 anchors=96/256 nce=0.999 loss=0.1328 |
| E20: mAP=0.835 F1=0.770 R@1=0.971 cos=0.812 cv=0.1159 anchors=96/256 nce=0.999 loss=0.1328 |
|
|
| Best mAP: 0.837 |
| CV target: 0.2731 |
| ``` |
| |
| |
| # Experiment 1: |
| |
| Total collapse. The three models did not conform and the patchwork did not learn. The objectives are not correct. |
| |
| One anchor was defaulted to, none of the others utilized. The memory bank solves this problem through queue assessment with the INFONCE hub processing, |
| but this model is a different form of anchoring that did not work. |
| |
| THE ENTIRE MODEL became the anchor, instead of the anchorpoints within the model. I'm thinking there wasn't enough scattering, so I'll try some additional tweaks. |
| |
| ## Post |
| ``` |
| Active anchors: 1/256 (0.4%) |
| Every single image β anchor 65 |
| Anchor entropy: 0.0000 |
| Anchors within cos>0.5 per image: 1.0 |
| Nearest anchor dist: 0.016 β next nearest: 0.665 |
|
|
| Effective dim: 23.6/128 |
| Top-20 SVs explain 99.2% |
| Self-sim off-diag: 0.969 |
|
|
| Expert uniqueness: 0.0008β0.0011 |
| ``` |
| There is only one active anchor, which is essentially CLS. The uniqueness collapsed. The distance is fine, the entropy is dead. |
| |
| Shortcut bypass, additional nonlinearity must be made. |
| |
| ## Assessment |
| Without the centered procrustes loss the same result happened. The collapse forms around one of the earlier anchors, around the outside middlepoint of |
| where all three models are simultaneously rotating around a point, which is not the direct center. |
| |
| This point has noise, invalidity, incorrect association, and additional problems based on the attention mechanisms internally to the models queried. |
| |
| ## Hypothesis based on research |
| The procrustes alignment must align centerwise, and it must be defined specifically to specifications. |