Update README.md

63c4951 verified 3 months ago

16.2 kB

	---
	license: apache-2.0
	tags:
	- geometric-deep-learning
	- vision
	- multi-expert
	- patchwork
	- hypersphere
	- from-scratch
	---

	# GeoLIP ViT Base x3

	Geometric vision system: 3-expert consensus soup + from-scratch ViT encoder.

	## Components

	### 1. Base Tier Soup (teacher)

	800K parameter geometric fusion of 3 pretrained vision experts on a 128-d hypersphere.

	\| Expert \| Architecture \| Training \| Dim \|
	\|--------\|-------------\|----------\|-----\|
	\| clip_l14_openai \| ViT-L/14 \| Text-supervised (CLIP) \| 768 \|
	\| dinov2_b14 \| ViT-B/14 \| Self-supervised (DINO) \| 768 \|
	\| siglip_b16_384 \| ViT-B/16 \| Sigmoid contrastive (SigLIP) \| 768 \|

	Pipeline: GPA alignment at 768-d → PCA to 128-d → per-expert whitened Procrustes calibration → Procrustes-initialized projectors → geometric autograd training.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| mAP (COCO) \| 0.837 \|
	\| Parameters \| 799,952 \|
	\| Anchors \| 256 × 128-d \|
	\| Consensus CV (768-d) \| 0.2793 \|
	\| Consensus CV (128-d) \| 0.2731 \|
	\| Optimizer \| Adam, no weight decay \|

	### 2. From-Scratch ViT Encoder (student)

	11M parameter ViT trained from Xavier initialization against the soup's consensus targets. No pretrained weights anywhere. Same architecture pattern as CaptionBERT.

	\| Config \| Value \|
	\|--------\|-------\|
	\| Layers \| 6 \|
	\| Hidden dim \| 384 \|
	\| Heads \| 6 \|
	\| FFN dim \| 1536 \|
	\| Patch size \| 16 \|
	\| Image size \| 224 \|
	\| Output dim \| 128 (on hypersphere) \|
	\| Parameters \| 11,216,768 \|

	Training: Raw COCO images → encoder → 128-d embedding → frozen soup pipeline (constellation + patchwork + classifier) → BCE loss. Additional losses: InfoNCE + MSE against consensus targets, whitened Procrustes alignment, pentachoron CV (calibrated to measured consensus).

	#### Results (20 epochs, still converging)

	\| Metric \| E1 \| E10 \| E20 \|
	\|--------\|-----\|------\|------\|
	\| nce_acc \| 0.340 \| 0.887 \| 0.972 \|
	\| cos→consensus \| 0.325 \| 0.557 \| 0.599 \|
	\| R@1 (5K) \| 0.032 \| 0.254 \| 0.323 \|
	\| mAP \| 0.151 \| 0.380 \| 0.429 \|
	\| F1 \| 0.162 \| 0.361 \| 0.418 \|
	\| Active anchors \| 95 \| 96 \| 94 \|

	All metrics still climbing at E20. Model needs 60-90 epochs to fully converge (matching CaptionBERT's text encoder trajectory).

	## Architecture
	```
	Training (soup as teacher):
	3 expert features → Procrustes projectors → mean → L2-norm → 128-d consensus targets
	Raw images → from-scratch ViT → 128-d embedding
	Losses: InfoNCE + MSE + CV + BCE(through frozen soup) + Procrustes alignment
	Geometric autograd: tangential=0.01, separation=1.0

	Inference (standalone):
	Raw image → ViT encoder → 128-d embedding (on hypersphere)
	No experts needed. Geometry is baked in.
	```

	## Key Findings

	- 800K soup params beat 81.7M (34-expert soup at 0.732 mAP) and 75.6M (34-expert bank at 0.782 mAP)
	- Proper calibration (GPA + whitened Procrustes + measured CV target) is essential — without it, constellation collapses to 1/256 active anchors
	- From-scratch ViT learns the 3-expert consensus representation from raw pixels with the same convergence dynamics as CaptionBERT on text
	- Cross-model weight cosine is 0.000 but activation Procrustes is 0.999 — the models encode identical geometry through completely different weight configurations

	## Files

	- `base_tier_soup_calibrated.pt` — Trained soup (teacher)
	- `geolip_vit_encoder_e20.pt` — ViT encoder at epoch 20
	- `base_tier_soup_calibrated.py` — Soup training script
	- `vit_encoder_from_scratch.py` — Encoder training script
	- `runs/` — Tensorboard logs

	## Data

	- Training features: [AbstractPhil/bulk-coco-features](https://huggingface.co/datasets/AbstractPhil/bulk-coco-features)
	- Images: COCO 2017 (118K train, 5K val)

	## Usage
	```python
	import torch

	# Load encoder
	ckpt = torch.load("geolip_vit_encoder_e20.pt", weights_only=False)
	# ckpt["encoder_state_dict"] — model weights
	# ckpt["config"] — architecture config
	# ckpt["mAP"], ckpt["cos"], ckpt["r1"] — metrics
	```


	---
	license: apache-2.0
	---

	# Experiment 2.5 Update: COCO convergence is slow but steady.

	BCE loss isn't the best catalyst for geometry but it does work to funnel through an aligned transformer.

	I underestimated the complexity of associative cross-modal differences, but it is converging. Shared space is a very tricky
	catalyst to teach as an associative connection. Routing is easy, distilling is not as easy with multimodal structures and multiple
	adjacent representations used as loss learning targets.

	If this fails to meet direct expectations, I'll form a proper hub and teach using the bertenstein method. Bertenstein works because it's
	always expecting to hear from the experts and there is always one anchored expert in charge.

	The expert/student distillation process requires skilled teachers with similar utility, which is different than simply funneling
	information through a route and pooling it.

	geolip-captionbert-8192 accepts this pooled funneled information and produces useful output due to the shared
	expert informations having similar access utilities.

	In either case, geolip-captionbert-8192 was trained from scratch and so is this model. They are not inheriting weights from any large-training,
	they are inheriting the geometry and structure through distillation in order to represent complex structure that quite simply should not exist
	in the smaller model by direct implicit learning.

	geolip-vit-x3 must learn to predict the pixel data using the output of the experts as markers for loss, which means it can never get a full picture of anything outside of it's own tools.

	This model is exceptionally small, absurdly small even by vit standards. This is because even at this size, this is too much. The model cannot
	overfit if the model uses every tool at the expense, this model will train indefinitely unless a cascade overflow happens, a math continuity corruption
	occurs, or the substructure collapses to a simpler shortcut-centric behavior that would require scrambling.

	The anchors are strong enough and tuned to the experts, the external losses are tuned to teach the expert responses, the expert data is used
	as loss methods of attenuation, and the structure conforms to those losses specifically because it's required to teach the model tobe
	standalone and compliant without requiring the experts later.

	I gave the model everything I could geometrically, and it must discover the way to connect them.

	I'm teaching siglip, dinovit and clip-vit to communicate on the same manifold. They are essentially speaking three dialects of foreign offshoot
	evolved thousand year later Roman.

	The fact that this works at all is a testament to the hypersphere attenuation.

	```
	=================================================================
	GEOLIP VISION ENCODER — FROM SCRATCH
	ViT: 6L/384d/6h, patch16
	196 patches + CLS → 128-d output
	Device: cuda
	=================================================================

	Loading soup...
	Soup: mAP=0.837 CV_target=0.2731
	train: loaded cached targets (118,287)
	val: loaded cached targets (5,000)
	Caching train images (118,287)...

	=================================================================
	BUILD ENCODER
	=================================================================
	Architecture: 6L/384d/6h, patch16
	Input: 224×224 → 196 patches
	Output: 128-d (on hypersphere)
	Parameters: 11,216,768

	=================================================================
	TRAINING
	20 epochs, lr=0.0003, batch=48
	Losses: InfoNCE + MSE + CV + BCE + Procrustes alignment
	CV target: 0.2731
	Images: train=118,287 val=5,000 (cached as tensors)
	=================================================================
	E 1/20 train: 100%\|██████████\| 2465/2465 [02:44<00:00, 14.97batch/s, cos=0.258, loss=2.6911, nce_acc=0.339, ordered=1]
	E1 train: 165s loss=2.6891 nce=2.2529 mse=0.0120 bce=0.1963 nce_acc=0.340
	E1 val: mAP=0.151 F1=0.162 R@1=0.032 cos=0.325 cv=0.2663 anchors=95/256 seen=5000/5000 ★
	E 2/20 train: 100%\|██████████\| 2465/2465 [02:40<00:00, 15.32batch/s, cos=0.368, loss=1.7954, nce_acc=0.553, ordered=1]
	E2 train: 161s loss=1.7948 nce=1.4297 mse=0.0099 bce=0.1473 nce_acc=0.553
	E2 val: mAP=0.206 F1=0.197 R@1=0.062 cos=0.390 cv=0.2552 anchors=99/256 seen=5000/5000 ★
	E 3/20 train: 100%\|██████████\| 2465/2465 [02:40<00:00, 15.37batch/s, cos=0.416, loss=1.4860, nce_acc=0.641, ordered=1]
	E3 train: 160s loss=1.4854 nce=1.1484 mse=0.0092 bce=0.1338 nce_acc=0.641
	E3 val: mAP=0.246 F1=0.244 R@1=0.091 cos=0.427 cv=0.2234 anchors=98/256 seen=5000/5000 ★
	E 4/20 train: 100%\|██████████\| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.448, loss=1.2913, nce_acc=0.695, ordered=1]
	E4 train: 160s loss=1.2910 nce=0.9727 mse=0.0087 bce=0.1265 nce_acc=0.695
	E4 val: mAP=0.272 F1=0.266 R@1=0.113 cos=0.453 cv=0.2078 anchors=99/256 seen=5000/5000 ★
	E 5/20 train: 100%\|██████████\| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.475, loss=1.1334, nce_acc=0.743, ordered=1]
	E5 train: 160s loss=1.1331 nce=0.8303 mse=0.0083 bce=0.1205 nce_acc=0.743
	E5 val: mAP=0.296 F1=0.292 R@1=0.139 cos=0.473 cv=0.2133 anchors=98/256 seen=5000/5000 ★
	E 6/20 train: 100%\|██████████\| 2465/2465 [02:37<00:00, 15.63batch/s, cos=0.499, loss=1.0005, nce_acc=0.784, ordered=1]
	E6 train: 158s loss=1.0003 nce=0.7111 mse=0.0079 bce=0.1158 nce_acc=0.784
	E6 val: mAP=0.317 F1=0.311 R@1=0.164 cos=0.495 cv=0.1835 anchors=98/256 seen=5000/5000 ★
	E 7/20 train: 100%\|██████████\| 2465/2465 [02:38<00:00, 15.60batch/s, cos=0.520, loss=0.8947, nce_acc=0.815, ordered=1]
	E7 train: 158s loss=0.8943 nce=0.6172 mse=0.0075 bce=0.1115 nce_acc=0.815
	E7 val: mAP=0.337 F1=0.335 R@1=0.190 cos=0.513 cv=0.1809 anchors=96/256 seen=5000/5000 ★
	E 8/20 train: 100%\|██████████\| 2465/2465 [02:38<00:00, 15.59batch/s, cos=0.539, loss=0.8030, nce_acc=0.842, ordered=1]
	E8 train: 158s loss=0.8028 nce=0.5365 mse=0.0072 bce=0.1076 nce_acc=0.843
	E8 val: mAP=0.344 F1=0.331 R@1=0.207 cos=0.523 cv=0.1779 anchors=95/256 seen=5000/5000 ★
	E 9/20 train: 100%\|██████████\| 2465/2465 [02:38<00:00, 15.58batch/s, cos=0.557, loss=0.7229, nce_acc=0.866, ordered=1]
	E9 train: 158s loss=0.7228 nce=0.4665 mse=0.0070 bce=0.1041 nce_acc=0.866
	E9 val: mAP=0.361 F1=0.349 R@1=0.218 cos=0.537 cv=0.1764 anchors=95/256 seen=5000/5000 ★
	E10/20 train: 36%\|███▌ \| 892/2465 [00:57<01:40, 15.69batch/s, cos=0.572, loss=0.6548, nce_acc=0.887, ordered=1]
	```


	# Experiment 2.5:
	The xavier aligned and procrustes embedding array attached to a standard patch16 subset should suffice.

	I'll be training this like CaptionBERT but with a twist, the soup expert is the alignment bank for this one, and I trained it first instead of later.

	The alignment and R1 is nearly perfect, so it should be cohesive enough through the chain of conceptualization to coalesce through the implications.

	Now it's another story, if the actual patches will learn based on the embedding and encoding spectrum, and how quickly I can make them learn.

	The output this encoder produces is a 128 dimensional enriched representational lookup plane on a hypersphere.
	This is more than enough information to house access to any data route that exists.

	The dimensional spectrum of a 5d object is so expansive and so enriched, that the entire spectrum of this shape requires a specific
	curation of the behavior. This is what most of the mechanisms are tasked with overall, pruning the effect of rigidity indifference preservation
	on the hypersphere represented structure.

	In other words, that 128 dimensions represents more information than I could express with words.

	# Experiment 2:
	95/256 anchors survive, emergent geometric structure formed.

	R@1= 97.1%, not quite but getting there. Experiment 2 was successful enough to push harder in this direction.

	Anchor collapse says it doesn't need all those anchors. It started grabbing at more by the end, which means
	the system was aligned and then started growing further on a constraint that I was unaware of.

	This drift curve needs to be controlled. Direct anchored emergence while training is risky. The bank itself
	survived so well because it was anchored post training, which gave added cohesion and complexity association
	that I have yet to discover the runtime process to train. I will be analyzing the emergence to preserve the anchoring.

	```
	=================================================================
	PHASE 5: TRAINING
	20 epochs, lr=0.001, CV target=0.2731
	=================================================================
	E 1: mAP=0.788 F1=0.731 R@1=0.971 cos=0.806 cv=0.1213 anchors=226/256 nce=0.999 loss=0.1676 ★
	E 2: mAP=0.803 F1=0.742 R@1=0.971 cos=0.809 cv=0.1178 anchors=200/256 nce=0.999 loss=0.1459 ★
	E 3: mAP=0.810 F1=0.735 R@1=0.973 cos=0.808 cv=0.1197 anchors=161/256 nce=0.999 loss=0.1431 ★
	E 4: mAP=0.817 F1=0.752 R@1=0.971 cos=0.811 cv=0.1262 anchors=131/256 nce=0.999 loss=0.1404 ★
	E 5: mAP=0.823 F1=0.755 R@1=0.971 cos=0.812 cv=0.1232 anchors=113/256 nce=0.999 loss=0.1389 ★
	E 6: mAP=0.825 F1=0.755 R@1=0.972 cos=0.815 cv=0.1105 anchors=104/256 nce=0.999 loss=0.1379 ★
	E 7: mAP=0.827 F1=0.767 R@1=0.970 cos=0.814 cv=0.1125 anchors=101/256 nce=0.999 loss=0.1369 ★
	E 8: mAP=0.829 F1=0.763 R@1=0.971 cos=0.815 cv=0.1239 anchors=99/256 nce=0.999 loss=0.1361 ★
	E 9: mAP=0.832 F1=0.764 R@1=0.972 cos=0.815 cv=0.1164 anchors=98/256 nce=0.999 loss=0.1355 ★
	E10: mAP=0.833 F1=0.765 R@1=0.968 cos=0.814 cv=0.1166 anchors=99/256 nce=0.999 loss=0.1345 ★
	E11: mAP=0.834 F1=0.763 R@1=0.971 cos=0.814 cv=0.1214 anchors=98/256 nce=0.999 loss=0.1346 ★
	E12: mAP=0.833 F1=0.764 R@1=0.973 cos=0.813 cv=0.1200 anchors=95/256 nce=0.999 loss=0.1343
	E13: mAP=0.836 F1=0.761 R@1=0.972 cos=0.813 cv=0.1081 anchors=94/256 nce=0.999 loss=0.1338 ★
	E14: mAP=0.836 F1=0.772 R@1=0.973 cos=0.812 cv=0.1170 anchors=95/256 nce=0.999 loss=0.1334
	E15: mAP=0.835 F1=0.774 R@1=0.970 cos=0.812 cv=0.1223 anchors=95/256 nce=0.999 loss=0.1338
	E16: mAP=0.837 F1=0.777 R@1=0.968 cos=0.812 cv=0.1225 anchors=96/256 nce=1.000 loss=0.1339 ★
	E17: mAP=0.834 F1=0.772 R@1=0.973 cos=0.811 cv=0.1089 anchors=95/256 nce=0.999 loss=0.1327
	E18: mAP=0.834 F1=0.770 R@1=0.973 cos=0.812 cv=0.1156 anchors=95/256 nce=0.999 loss=0.1321
	E19: mAP=0.834 F1=0.773 R@1=0.970 cos=0.811 cv=0.1224 anchors=96/256 nce=0.999 loss=0.1328
	E20: mAP=0.835 F1=0.770 R@1=0.971 cos=0.812 cv=0.1159 anchors=96/256 nce=0.999 loss=0.1328

	Best mAP: 0.837
	CV target: 0.2731
	```


	# Experiment 1:

	Total collapse. The three models did not conform and the patchwork did not learn. The objectives are not correct.

	One anchor was defaulted to, none of the others utilized. The memory bank solves this problem through queue assessment with the INFONCE hub processing,
	but this model is a different form of anchoring that did not work.

	THE ENTIRE MODEL became the anchor, instead of the anchorpoints within the model. I'm thinking there wasn't enough scattering, so I'll try some additional tweaks.

	## Post
	```
	Active anchors: 1/256 (0.4%)
	Every single image → anchor 65
	Anchor entropy: 0.0000
	Anchors within cos>0.5 per image: 1.0
	Nearest anchor dist: 0.016 — next nearest: 0.665

	Effective dim: 23.6/128
	Top-20 SVs explain 99.2%
	Self-sim off-diag: 0.969

	Expert uniqueness: 0.0008–0.0011
	```
	There is only one active anchor, which is essentially CLS. The uniqueness collapsed. The distance is fine, the entropy is dead.

	Shortcut bypass, additional nonlinearity must be made.

	## Assessment
	Without the centered procrustes loss the same result happened. The collapse forms around one of the earlier anchors, around the outside middlepoint of
	where all three models are simultaneously rotating around a point, which is not the direct center.

	This point has noise, invalidity, incorrect association, and additional problems based on the attention mechanisms internally to the models queried.

	## Hypothesis based on research
	The procrustes alignment must align centerwise, and it must be defined specifically to specifications.