AbstractPhil commited on
Commit
496d7d3
·
verified ·
1 Parent(s): 87f65ba

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +81 -168
README.md CHANGED
@@ -1,198 +1,111 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
4
 
5
- # Experiment 2.5 Update: COCO convergence is slow but steady.
6
 
7
- BCE loss isn't the best catalyst for geometry but it does work to funnel through an aligned transformer.
8
 
9
- I underestimated the complexity of associative cross-modal differences, but it is converging. Shared space is a very tricky
10
- catalyst to teach as an associative connection. Routing is easy, distilling is not as easy with multimodal structures and multiple
11
- adjacent representations used as loss learning targets.
12
 
13
- If this fails to meet direct expectations, I'll form a proper hub and teach using the bertenstein method. Bertenstein works because it's
14
- always expecting to hear from the experts and there is always one anchored expert in charge.
15
 
16
- The expert/student distillation process requires skilled teachers with similar utility, which is different than simply funneling
17
- information through a route and pooling it.
18
 
19
- geolip-captionbert-8192 accepts this pooled funneled information and produces useful output due to the shared
20
- expert informations having similar access utilities.
 
 
 
21
 
22
- In either case, geolip-captionbert-8192 was trained from scratch and so is this model. They are not inheriting weights from any large-training,
23
- they are inheriting the geometry and structure through distillation in order to represent complex structure that quite simply should not exist
24
- in the smaller model by direct implicit learning.
25
 
26
- geolip-vit-x3 must learn to predict the pixel data using the output of the experts as markers for loss, which means it can never get a full picture of anything outside of it's own tools.
 
 
 
 
 
 
 
27
 
28
- This model is exceptionally small, absurdly small even by vit standards. This is because even at this size, this is too much. The model cannot
29
- overfit if the model uses every tool at the expense, this model will train indefinitely unless a cascade overflow happens, a math continuity corruption
30
- occurs, or the substructure collapses to a simpler shortcut-centric behavior that would require scrambling.
31
 
32
- The anchors are strong enough and tuned to the experts, the external losses are tuned to teach the expert responses, the expert data is used
33
- as loss methods of attenuation, and the structure conforms to those losses specifically because it's required to teach the model tobe
34
- standalone and compliant without requiring the experts later.
35
 
36
- I gave the model everything I could geometrically, and it must discover the way to connect them.
 
 
 
 
 
 
 
 
 
37
 
38
- I'm teaching siglip, dinovit and clip-vit to communicate on the same manifold. They are essentially speaking three dialects of foreign offshoot
39
- evolved thousand year later Roman.
40
 
41
- The fact that this works at all is a testament to the hypersphere attenuation.
42
 
43
- ```
44
- =================================================================
45
- GEOLIP VISION ENCODER — FROM SCRATCH
46
- ViT: 6L/384d/6h, patch16
47
- 196 patches + CLS → 128-d output
48
- Device: cuda
49
- =================================================================
50
-
51
- Loading soup...
52
- Soup: mAP=0.837 CV_target=0.2731
53
- train: loaded cached targets (118,287)
54
- val: loaded cached targets (5,000)
55
- Caching train images (118,287)...
56
-
57
- =================================================================
58
- BUILD ENCODER
59
- =================================================================
60
- Architecture: 6L/384d/6h, patch16
61
- Input: 224×224 → 196 patches
62
- Output: 128-d (on hypersphere)
63
- Parameters: 11,216,768
64
-
65
- =================================================================
66
- TRAINING
67
- 20 epochs, lr=0.0003, batch=48
68
- Losses: InfoNCE + MSE + CV + BCE + Procrustes alignment
69
- CV target: 0.2731
70
- Images: train=118,287 val=5,000 (cached as tensors)
71
- =================================================================
72
- E 1/20 train: 100%|██████████| 2465/2465 [02:44<00:00, 14.97batch/s, cos=0.258, loss=2.6911, nce_acc=0.339, ordered=1]
73
- E1 train: 165s loss=2.6891 nce=2.2529 mse=0.0120 bce=0.1963 nce_acc=0.340
74
- E1 val: mAP=0.151 F1=0.162 R@1=0.032 cos=0.325 cv=0.2663 anchors=95/256 seen=5000/5000 ★
75
- E 2/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.32batch/s, cos=0.368, loss=1.7954, nce_acc=0.553, ordered=1]
76
- E2 train: 161s loss=1.7948 nce=1.4297 mse=0.0099 bce=0.1473 nce_acc=0.553
77
- E2 val: mAP=0.206 F1=0.197 R@1=0.062 cos=0.390 cv=0.2552 anchors=99/256 seen=5000/5000 ★
78
- E 3/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.37batch/s, cos=0.416, loss=1.4860, nce_acc=0.641, ordered=1]
79
- E3 train: 160s loss=1.4854 nce=1.1484 mse=0.0092 bce=0.1338 nce_acc=0.641
80
- E3 val: mAP=0.246 F1=0.244 R@1=0.091 cos=0.427 cv=0.2234 anchors=98/256 seen=5000/5000 ★
81
- E 4/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.448, loss=1.2913, nce_acc=0.695, ordered=1]
82
- E4 train: 160s loss=1.2910 nce=0.9727 mse=0.0087 bce=0.1265 nce_acc=0.695
83
- E4 val: mAP=0.272 F1=0.266 R@1=0.113 cos=0.453 cv=0.2078 anchors=99/256 seen=5000/5000 ★
84
- E 5/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.475, loss=1.1334, nce_acc=0.743, ordered=1]
85
- E5 train: 160s loss=1.1331 nce=0.8303 mse=0.0083 bce=0.1205 nce_acc=0.743
86
- E5 val: mAP=0.296 F1=0.292 R@1=0.139 cos=0.473 cv=0.2133 anchors=98/256 seen=5000/5000 ★
87
- E 6/20 train: 100%|██████████| 2465/2465 [02:37<00:00, 15.63batch/s, cos=0.499, loss=1.0005, nce_acc=0.784, ordered=1]
88
- E6 train: 158s loss=1.0003 nce=0.7111 mse=0.0079 bce=0.1158 nce_acc=0.784
89
- E6 val: mAP=0.317 F1=0.311 R@1=0.164 cos=0.495 cv=0.1835 anchors=98/256 seen=5000/5000 ★
90
- E 7/20 train: 100%|██████████| 2465/2465 [02:38<00:00, 15.60batch/s, cos=0.520, loss=0.8947, nce_acc=0.815, ordered=1]
91
- E7 train: 158s loss=0.8943 nce=0.6172 mse=0.0075 bce=0.1115 nce_acc=0.815
92
- E7 val: mAP=0.337 F1=0.335 R@1=0.190 cos=0.513 cv=0.1809 anchors=96/256 seen=5000/5000 ★
93
- E 8/20 train: 100%|██████████| 2465/2465 [02:38<00:00, 15.59batch/s, cos=0.539, loss=0.8030, nce_acc=0.842, ordered=1]
94
- E8 train: 158s loss=0.8028 nce=0.5365 mse=0.0072 bce=0.1076 nce_acc=0.843
95
- E8 val: mAP=0.344 F1=0.331 R@1=0.207 cos=0.523 cv=0.1779 anchors=95/256 seen=5000/5000 ★
96
- E 9/20 train: 100%|██████████| 2465/2465 [02:38<00:00, 15.58batch/s, cos=0.557, loss=0.7229, nce_acc=0.866, ordered=1]
97
- E9 train: 158s loss=0.7228 nce=0.4665 mse=0.0070 bce=0.1041 nce_acc=0.866
98
- E9 val: mAP=0.361 F1=0.349 R@1=0.218 cos=0.537 cv=0.1764 anchors=95/256 seen=5000/5000 ★
99
- E10/20 train: 36%|███▌ | 892/2465 [00:57<01:40, 15.69batch/s, cos=0.572, loss=0.6548, nce_acc=0.887, ordered=1]
100
- ```
101
-
102
-
103
- # Experiment 2.5:
104
- The xavier aligned and procrustes embedding array attached to a standard patch16 subset should suffice.
105
-
106
- I'll be training this like CaptionBERT but with a twist, the soup expert is the alignment bank for this one, and I trained it first instead of later.
107
-
108
- The alignment and R1 is nearly perfect, so it should be cohesive enough through the chain of conceptualization to coalesce through the implications.
109
-
110
- Now it's another story, if the actual patches will learn based on the embedding and encoding spectrum, and how quickly I can make them learn.
111
-
112
- The output this encoder produces is a 128 dimensional enriched representational lookup plane on a hypersphere.
113
- This is more than enough information to house access to any data route that exists.
114
-
115
- The dimensional spectrum of a 5d object is so expansive and so enriched, that the entire spectrum of this shape requires a specific
116
- curation of the behavior. This is what most of the mechanisms are tasked with overall, pruning the effect of rigidity indifference preservation
117
- on the hypersphere represented structure.
118
 
119
- In other words, that 128 dimensions represents more information than I could express with words.
120
-
121
- # Experiment 2:
122
- 95/256 anchors survive, emergent geometric structure formed.
123
-
124
- R@1= 97.1%, not quite but getting there. Experiment 2 was successful enough to push harder in this direction.
125
-
126
- Anchor collapse says it doesn't need all those anchors. It started grabbing at more by the end, which means
127
- the system was aligned and then started growing further on a constraint that I was unaware of.
128
-
129
- This drift curve needs to be controlled. Direct anchored emergence while training is risky. The bank itself
130
- survived so well because it was anchored post training, which gave added cohesion and complexity association
131
- that I have yet to discover the runtime process to train. I will be analyzing the emergence to preserve the anchoring.
132
 
 
133
  ```
134
- =================================================================
135
- PHASE 5: TRAINING
136
- 20 epochs, lr=0.001, CV target=0.2731
137
- =================================================================
138
- E 1: mAP=0.788 F1=0.731 R@1=0.971 cos=0.806 cv=0.1213 anchors=226/256 nce=0.999 loss=0.1676 ★
139
- E 2: mAP=0.803 F1=0.742 R@1=0.971 cos=0.809 cv=0.1178 anchors=200/256 nce=0.999 loss=0.1459 ★
140
- E 3: mAP=0.810 F1=0.735 R@1=0.973 cos=0.808 cv=0.1197 anchors=161/256 nce=0.999 loss=0.1431 ★
141
- E 4: mAP=0.817 F1=0.752 R@1=0.971 cos=0.811 cv=0.1262 anchors=131/256 nce=0.999 loss=0.1404 ★
142
- E 5: mAP=0.823 F1=0.755 R@1=0.971 cos=0.812 cv=0.1232 anchors=113/256 nce=0.999 loss=0.1389 ★
143
- E 6: mAP=0.825 F1=0.755 R@1=0.972 cos=0.815 cv=0.1105 anchors=104/256 nce=0.999 loss=0.1379 ★
144
- E 7: mAP=0.827 F1=0.767 R@1=0.970 cos=0.814 cv=0.1125 anchors=101/256 nce=0.999 loss=0.1369 ★
145
- E 8: mAP=0.829 F1=0.763 R@1=0.971 cos=0.815 cv=0.1239 anchors=99/256 nce=0.999 loss=0.1361 ★
146
- E 9: mAP=0.832 F1=0.764 R@1=0.972 cos=0.815 cv=0.1164 anchors=98/256 nce=0.999 loss=0.1355 ★
147
- E10: mAP=0.833 F1=0.765 R@1=0.968 cos=0.814 cv=0.1166 anchors=99/256 nce=0.999 loss=0.1345 ★
148
- E11: mAP=0.834 F1=0.763 R@1=0.971 cos=0.814 cv=0.1214 anchors=98/256 nce=0.999 loss=0.1346 ★
149
- E12: mAP=0.833 F1=0.764 R@1=0.973 cos=0.813 cv=0.1200 anchors=95/256 nce=0.999 loss=0.1343
150
- E13: mAP=0.836 F1=0.761 R@1=0.972 cos=0.813 cv=0.1081 anchors=94/256 nce=0.999 loss=0.1338 ★
151
- E14: mAP=0.836 F1=0.772 R@1=0.973 cos=0.812 cv=0.1170 anchors=95/256 nce=0.999 loss=0.1334
152
- E15: mAP=0.835 F1=0.774 R@1=0.970 cos=0.812 cv=0.1223 anchors=95/256 nce=0.999 loss=0.1338
153
- E16: mAP=0.837 F1=0.777 R@1=0.968 cos=0.812 cv=0.1225 anchors=96/256 nce=1.000 loss=0.1339 ★
154
- E17: mAP=0.834 F1=0.772 R@1=0.973 cos=0.811 cv=0.1089 anchors=95/256 nce=0.999 loss=0.1327
155
- E18: mAP=0.834 F1=0.770 R@1=0.973 cos=0.812 cv=0.1156 anchors=95/256 nce=0.999 loss=0.1321
156
- E19: mAP=0.834 F1=0.773 R@1=0.970 cos=0.811 cv=0.1224 anchors=96/256 nce=0.999 loss=0.1328
157
- E20: mAP=0.835 F1=0.770 R@1=0.971 cos=0.812 cv=0.1159 anchors=96/256 nce=0.999 loss=0.1328
158
-
159
- Best mAP: 0.837
160
- CV target: 0.2731
161
  ```
162
 
 
163
 
164
- # Experiment 1:
 
 
 
165
 
166
- Total collapse. The three models did not conform and the patchwork did not learn. The objectives are not correct.
167
 
168
- One anchor was defaulted to, none of the others utilized. The memory bank solves this problem through queue assessment with the INFONCE hub processing,
169
- but this model is a different form of anchoring that did not work.
 
 
 
170
 
171
- THE ENTIRE MODEL became the anchor, instead of the anchorpoints within the model. I'm thinking there wasn't enough scattering, so I'll try some additional tweaks.
172
 
173
- ## Post
174
- ```
175
- Active anchors: 1/256 (0.4%)
176
- Every single image → anchor 65
177
- Anchor entropy: 0.0000
178
- Anchors within cos>0.5 per image: 1.0
179
- Nearest anchor dist: 0.016 — next nearest: 0.665
180
 
181
- Effective dim: 23.6/128
182
- Top-20 SVs explain 99.2%
183
- Self-sim off-diag: 0.969
184
 
185
- Expert uniqueness: 0.0008–0.0011
 
 
 
 
186
  ```
187
- There is only one active anchor, which is essentially CLS. The uniqueness collapsed. The distance is fine, the entropy is dead.
188
-
189
- Shortcut bypass, additional nonlinearity must be made.
190
-
191
- ## Assessment
192
- Without the centered procrustes loss the same result happened. The collapse forms around one of the earlier anchors, around the outside middlepoint of
193
- where all three models are simultaneously rotating around a point, which is not the direct center.
194
-
195
- This point has noise, invalidity, incorrect association, and additional problems based on the attention mechanisms internally to the models queried.
196
-
197
- ## Hypothesis based on research
198
- The procrustes alignment must align centerwise, and it must be defined specifically to specifications.
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - geometric-deep-learning
5
+ - vision
6
+ - multi-expert
7
+ - patchwork
8
+ - hypersphere
9
+ - from-scratch
10
  ---
11
 
12
+ # GeoLIP ViT Base x3
13
 
14
+ Geometric vision system: 3-expert consensus soup + from-scratch ViT encoder.
15
 
16
+ ## Components
 
 
17
 
18
+ ### 1. Base Tier Soup (teacher)
 
19
 
20
+ 800K parameter geometric fusion of 3 pretrained vision experts on a 128-d hypersphere.
 
21
 
22
+ | Expert | Architecture | Training | Dim |
23
+ |--------|-------------|----------|-----|
24
+ | clip_l14_openai | ViT-L/14 | Text-supervised (CLIP) | 768 |
25
+ | dinov2_b14 | ViT-B/14 | Self-supervised (DINO) | 768 |
26
+ | siglip_b16_384 | ViT-B/16 | Sigmoid contrastive (SigLIP) | 768 |
27
 
28
+ **Pipeline:** GPA alignment at 768-d → PCA to 128-d → per-expert whitened Procrustes calibration → Procrustes-initialized projectors → geometric autograd training.
 
 
29
 
30
+ | Metric | Value |
31
+ |--------|-------|
32
+ | mAP (COCO) | 0.837 |
33
+ | Parameters | 799,952 |
34
+ | Anchors | 256 × 128-d |
35
+ | Consensus CV (768-d) | 0.2793 |
36
+ | Consensus CV (128-d) | 0.2731 |
37
+ | Optimizer | Adam, no weight decay |
38
 
39
+ ### 2. From-Scratch ViT Encoder (student)
 
 
40
 
41
+ 11M parameter ViT trained from Xavier initialization against the soup's consensus targets. No pretrained weights anywhere. Same architecture pattern as CaptionBERT.
 
 
42
 
43
+ | Config | Value |
44
+ |--------|-------|
45
+ | Layers | 6 |
46
+ | Hidden dim | 384 |
47
+ | Heads | 6 |
48
+ | FFN dim | 1536 |
49
+ | Patch size | 16 |
50
+ | Image size | 224 |
51
+ | Output dim | 128 (on hypersphere) |
52
+ | Parameters | 11,216,768 |
53
 
54
+ **Training:** Raw COCO images → encoder → 128-d embedding → frozen soup pipeline (constellation + patchwork + classifier) → BCE loss. Additional losses: InfoNCE + MSE against consensus targets, whitened Procrustes alignment, pentachoron CV (calibrated to measured consensus).
 
55
 
56
+ #### Results (20 epochs, still converging)
57
 
58
+ | Metric | E1 | E10 | E20 |
59
+ |--------|-----|------|------|
60
+ | nce_acc | 0.340 | 0.887 | 0.972 |
61
+ | cos→consensus | 0.325 | 0.557 | 0.599 |
62
+ | R@1 (5K) | 0.032 | 0.254 | 0.323 |
63
+ | mAP | 0.151 | 0.380 | 0.429 |
64
+ | F1 | 0.162 | 0.361 | 0.418 |
65
+ | Active anchors | 95 | 96 | 94 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
+ All metrics still climbing at E20. Model needs 60-90 epochs to fully converge (matching CaptionBERT's text encoder trajectory).
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
+ ## Architecture
70
  ```
71
+ Training (soup as teacher):
72
+ 3 expert features → Procrustes projectors → mean → L2-norm → 128-d consensus targets
73
+ Raw images → from-scratch ViT → 128-d embedding
74
+ Losses: InfoNCE + MSE + CV + BCE(through frozen soup) + Procrustes alignment
75
+ Geometric autograd: tangential=0.01, separation=1.0
76
+
77
+ Inference (standalone):
78
+ Raw image → ViT encoder → 128-d embedding (on hypersphere)
79
+ No experts needed. Geometry is baked in.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ```
81
 
82
+ ## Key Findings
83
 
84
+ - 800K soup params beat 81.7M (34-expert soup at 0.732 mAP) and 75.6M (34-expert bank at 0.782 mAP)
85
+ - Proper calibration (GPA + whitened Procrustes + measured CV target) is essential — without it, constellation collapses to 1/256 active anchors
86
+ - From-scratch ViT learns the 3-expert consensus representation from raw pixels with the same convergence dynamics as CaptionBERT on text
87
+ - Cross-model weight cosine is 0.000 but activation Procrustes is 0.999 — the models encode identical geometry through completely different weight configurations
88
 
89
+ ## Files
90
 
91
+ - `base_tier_soup_calibrated.pt` — Trained soup (teacher)
92
+ - `geolip_vit_encoder_e20.pt` — ViT encoder at epoch 20
93
+ - `base_tier_soup_calibrated.py` — Soup training script
94
+ - `vit_encoder_from_scratch.py` — Encoder training script
95
+ - `runs/` — Tensorboard logs
96
 
97
+ ## Data
98
 
99
+ - Training features: [AbstractPhil/bulk-coco-features](https://huggingface.co/datasets/AbstractPhil/bulk-coco-features)
100
+ - Images: COCO 2017 (118K train, 5K val)
 
 
 
 
 
101
 
102
+ ## Usage
103
+ ```python
104
+ import torch
105
 
106
+ # Load encoder
107
+ ckpt = torch.load("geolip_vit_encoder_e20.pt", weights_only=False)
108
+ # ckpt["encoder_state_dict"] — model weights
109
+ # ckpt["config"] — architecture config
110
+ # ckpt["mAP"], ckpt["cos"], ckpt["r1"] — metrics
111
  ```