Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,198 +1,111 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
#
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
| 10 |
-
catalyst to teach as an associative connection. Routing is easy, distilling is not as easy with multimodal structures and multiple
|
| 11 |
-
adjacent representations used as loss learning targets.
|
| 12 |
|
| 13 |
-
|
| 14 |
-
always expecting to hear from the experts and there is always one anchored expert in charge.
|
| 15 |
|
| 16 |
-
|
| 17 |
-
information through a route and pooling it.
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
-
they are inheriting the geometry and structure through distillation in order to represent complex structure that quite simply should not exist
|
| 24 |
-
in the smaller model by direct implicit learning.
|
| 25 |
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
-
|
| 29 |
-
overfit if the model uses every tool at the expense, this model will train indefinitely unless a cascade overflow happens, a math continuity corruption
|
| 30 |
-
occurs, or the substructure collapses to a simpler shortcut-centric behavior that would require scrambling.
|
| 31 |
|
| 32 |
-
|
| 33 |
-
as loss methods of attenuation, and the structure conforms to those losses specifically because it's required to teach the model tobe
|
| 34 |
-
standalone and compliant without requiring the experts later.
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
-
evolved thousand year later Roman.
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
Loading soup...
|
| 52 |
-
Soup: mAP=0.837 CV_target=0.2731
|
| 53 |
-
train: loaded cached targets (118,287)
|
| 54 |
-
val: loaded cached targets (5,000)
|
| 55 |
-
Caching train images (118,287)...
|
| 56 |
-
|
| 57 |
-
=================================================================
|
| 58 |
-
BUILD ENCODER
|
| 59 |
-
=================================================================
|
| 60 |
-
Architecture: 6L/384d/6h, patch16
|
| 61 |
-
Input: 224×224 → 196 patches
|
| 62 |
-
Output: 128-d (on hypersphere)
|
| 63 |
-
Parameters: 11,216,768
|
| 64 |
-
|
| 65 |
-
=================================================================
|
| 66 |
-
TRAINING
|
| 67 |
-
20 epochs, lr=0.0003, batch=48
|
| 68 |
-
Losses: InfoNCE + MSE + CV + BCE + Procrustes alignment
|
| 69 |
-
CV target: 0.2731
|
| 70 |
-
Images: train=118,287 val=5,000 (cached as tensors)
|
| 71 |
-
=================================================================
|
| 72 |
-
E 1/20 train: 100%|██████████| 2465/2465 [02:44<00:00, 14.97batch/s, cos=0.258, loss=2.6911, nce_acc=0.339, ordered=1]
|
| 73 |
-
E1 train: 165s loss=2.6891 nce=2.2529 mse=0.0120 bce=0.1963 nce_acc=0.340
|
| 74 |
-
E1 val: mAP=0.151 F1=0.162 R@1=0.032 cos=0.325 cv=0.2663 anchors=95/256 seen=5000/5000 ★
|
| 75 |
-
E 2/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.32batch/s, cos=0.368, loss=1.7954, nce_acc=0.553, ordered=1]
|
| 76 |
-
E2 train: 161s loss=1.7948 nce=1.4297 mse=0.0099 bce=0.1473 nce_acc=0.553
|
| 77 |
-
E2 val: mAP=0.206 F1=0.197 R@1=0.062 cos=0.390 cv=0.2552 anchors=99/256 seen=5000/5000 ★
|
| 78 |
-
E 3/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.37batch/s, cos=0.416, loss=1.4860, nce_acc=0.641, ordered=1]
|
| 79 |
-
E3 train: 160s loss=1.4854 nce=1.1484 mse=0.0092 bce=0.1338 nce_acc=0.641
|
| 80 |
-
E3 val: mAP=0.246 F1=0.244 R@1=0.091 cos=0.427 cv=0.2234 anchors=98/256 seen=5000/5000 ★
|
| 81 |
-
E 4/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.448, loss=1.2913, nce_acc=0.695, ordered=1]
|
| 82 |
-
E4 train: 160s loss=1.2910 nce=0.9727 mse=0.0087 bce=0.1265 nce_acc=0.695
|
| 83 |
-
E4 val: mAP=0.272 F1=0.266 R@1=0.113 cos=0.453 cv=0.2078 anchors=99/256 seen=5000/5000 ★
|
| 84 |
-
E 5/20 train: 100%|██████████| 2465/2465 [02:40<00:00, 15.40batch/s, cos=0.475, loss=1.1334, nce_acc=0.743, ordered=1]
|
| 85 |
-
E5 train: 160s loss=1.1331 nce=0.8303 mse=0.0083 bce=0.1205 nce_acc=0.743
|
| 86 |
-
E5 val: mAP=0.296 F1=0.292 R@1=0.139 cos=0.473 cv=0.2133 anchors=98/256 seen=5000/5000 ★
|
| 87 |
-
E 6/20 train: 100%|██████████| 2465/2465 [02:37<00:00, 15.63batch/s, cos=0.499, loss=1.0005, nce_acc=0.784, ordered=1]
|
| 88 |
-
E6 train: 158s loss=1.0003 nce=0.7111 mse=0.0079 bce=0.1158 nce_acc=0.784
|
| 89 |
-
E6 val: mAP=0.317 F1=0.311 R@1=0.164 cos=0.495 cv=0.1835 anchors=98/256 seen=5000/5000 ★
|
| 90 |
-
E 7/20 train: 100%|██████████| 2465/2465 [02:38<00:00, 15.60batch/s, cos=0.520, loss=0.8947, nce_acc=0.815, ordered=1]
|
| 91 |
-
E7 train: 158s loss=0.8943 nce=0.6172 mse=0.0075 bce=0.1115 nce_acc=0.815
|
| 92 |
-
E7 val: mAP=0.337 F1=0.335 R@1=0.190 cos=0.513 cv=0.1809 anchors=96/256 seen=5000/5000 ★
|
| 93 |
-
E 8/20 train: 100%|██████████| 2465/2465 [02:38<00:00, 15.59batch/s, cos=0.539, loss=0.8030, nce_acc=0.842, ordered=1]
|
| 94 |
-
E8 train: 158s loss=0.8028 nce=0.5365 mse=0.0072 bce=0.1076 nce_acc=0.843
|
| 95 |
-
E8 val: mAP=0.344 F1=0.331 R@1=0.207 cos=0.523 cv=0.1779 anchors=95/256 seen=5000/5000 ★
|
| 96 |
-
E 9/20 train: 100%|██████████| 2465/2465 [02:38<00:00, 15.58batch/s, cos=0.557, loss=0.7229, nce_acc=0.866, ordered=1]
|
| 97 |
-
E9 train: 158s loss=0.7228 nce=0.4665 mse=0.0070 bce=0.1041 nce_acc=0.866
|
| 98 |
-
E9 val: mAP=0.361 F1=0.349 R@1=0.218 cos=0.537 cv=0.1764 anchors=95/256 seen=5000/5000 ★
|
| 99 |
-
E10/20 train: 36%|███▌ | 892/2465 [00:57<01:40, 15.69batch/s, cos=0.572, loss=0.6548, nce_acc=0.887, ordered=1]
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
# Experiment 2.5:
|
| 104 |
-
The xavier aligned and procrustes embedding array attached to a standard patch16 subset should suffice.
|
| 105 |
-
|
| 106 |
-
I'll be training this like CaptionBERT but with a twist, the soup expert is the alignment bank for this one, and I trained it first instead of later.
|
| 107 |
-
|
| 108 |
-
The alignment and R1 is nearly perfect, so it should be cohesive enough through the chain of conceptualization to coalesce through the implications.
|
| 109 |
-
|
| 110 |
-
Now it's another story, if the actual patches will learn based on the embedding and encoding spectrum, and how quickly I can make them learn.
|
| 111 |
-
|
| 112 |
-
The output this encoder produces is a 128 dimensional enriched representational lookup plane on a hypersphere.
|
| 113 |
-
This is more than enough information to house access to any data route that exists.
|
| 114 |
-
|
| 115 |
-
The dimensional spectrum of a 5d object is so expansive and so enriched, that the entire spectrum of this shape requires a specific
|
| 116 |
-
curation of the behavior. This is what most of the mechanisms are tasked with overall, pruning the effect of rigidity indifference preservation
|
| 117 |
-
on the hypersphere represented structure.
|
| 118 |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
# Experiment 2:
|
| 122 |
-
95/256 anchors survive, emergent geometric structure formed.
|
| 123 |
-
|
| 124 |
-
R@1= 97.1%, not quite but getting there. Experiment 2 was successful enough to push harder in this direction.
|
| 125 |
-
|
| 126 |
-
Anchor collapse says it doesn't need all those anchors. It started grabbing at more by the end, which means
|
| 127 |
-
the system was aligned and then started growing further on a constraint that I was unaware of.
|
| 128 |
-
|
| 129 |
-
This drift curve needs to be controlled. Direct anchored emergence while training is risky. The bank itself
|
| 130 |
-
survived so well because it was anchored post training, which gave added cohesion and complexity association
|
| 131 |
-
that I have yet to discover the runtime process to train. I will be analyzing the emergence to preserve the anchoring.
|
| 132 |
|
|
|
|
| 133 |
```
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
E 6: mAP=0.825 F1=0.755 R@1=0.972 cos=0.815 cv=0.1105 anchors=104/256 nce=0.999 loss=0.1379 ★
|
| 144 |
-
E 7: mAP=0.827 F1=0.767 R@1=0.970 cos=0.814 cv=0.1125 anchors=101/256 nce=0.999 loss=0.1369 ★
|
| 145 |
-
E 8: mAP=0.829 F1=0.763 R@1=0.971 cos=0.815 cv=0.1239 anchors=99/256 nce=0.999 loss=0.1361 ★
|
| 146 |
-
E 9: mAP=0.832 F1=0.764 R@1=0.972 cos=0.815 cv=0.1164 anchors=98/256 nce=0.999 loss=0.1355 ★
|
| 147 |
-
E10: mAP=0.833 F1=0.765 R@1=0.968 cos=0.814 cv=0.1166 anchors=99/256 nce=0.999 loss=0.1345 ★
|
| 148 |
-
E11: mAP=0.834 F1=0.763 R@1=0.971 cos=0.814 cv=0.1214 anchors=98/256 nce=0.999 loss=0.1346 ★
|
| 149 |
-
E12: mAP=0.833 F1=0.764 R@1=0.973 cos=0.813 cv=0.1200 anchors=95/256 nce=0.999 loss=0.1343
|
| 150 |
-
E13: mAP=0.836 F1=0.761 R@1=0.972 cos=0.813 cv=0.1081 anchors=94/256 nce=0.999 loss=0.1338 ★
|
| 151 |
-
E14: mAP=0.836 F1=0.772 R@1=0.973 cos=0.812 cv=0.1170 anchors=95/256 nce=0.999 loss=0.1334
|
| 152 |
-
E15: mAP=0.835 F1=0.774 R@1=0.970 cos=0.812 cv=0.1223 anchors=95/256 nce=0.999 loss=0.1338
|
| 153 |
-
E16: mAP=0.837 F1=0.777 R@1=0.968 cos=0.812 cv=0.1225 anchors=96/256 nce=1.000 loss=0.1339 ★
|
| 154 |
-
E17: mAP=0.834 F1=0.772 R@1=0.973 cos=0.811 cv=0.1089 anchors=95/256 nce=0.999 loss=0.1327
|
| 155 |
-
E18: mAP=0.834 F1=0.770 R@1=0.973 cos=0.812 cv=0.1156 anchors=95/256 nce=0.999 loss=0.1321
|
| 156 |
-
E19: mAP=0.834 F1=0.773 R@1=0.970 cos=0.811 cv=0.1224 anchors=96/256 nce=0.999 loss=0.1328
|
| 157 |
-
E20: mAP=0.835 F1=0.770 R@1=0.971 cos=0.812 cv=0.1159 anchors=96/256 nce=0.999 loss=0.1328
|
| 158 |
-
|
| 159 |
-
Best mAP: 0.837
|
| 160 |
-
CV target: 0.2731
|
| 161 |
```
|
| 162 |
|
|
|
|
| 163 |
|
| 164 |
-
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
-
|
| 167 |
|
| 168 |
-
|
| 169 |
-
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
-
|
| 172 |
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
Active anchors: 1/256 (0.4%)
|
| 176 |
-
Every single image → anchor 65
|
| 177 |
-
Anchor entropy: 0.0000
|
| 178 |
-
Anchors within cos>0.5 per image: 1.0
|
| 179 |
-
Nearest anchor dist: 0.016 — next nearest: 0.665
|
| 180 |
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
|
| 185 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 186 |
```
|
| 187 |
-
There is only one active anchor, which is essentially CLS. The uniqueness collapsed. The distance is fine, the entropy is dead.
|
| 188 |
-
|
| 189 |
-
Shortcut bypass, additional nonlinearity must be made.
|
| 190 |
-
|
| 191 |
-
## Assessment
|
| 192 |
-
Without the centered procrustes loss the same result happened. The collapse forms around one of the earlier anchors, around the outside middlepoint of
|
| 193 |
-
where all three models are simultaneously rotating around a point, which is not the direct center.
|
| 194 |
-
|
| 195 |
-
This point has noise, invalidity, incorrect association, and additional problems based on the attention mechanisms internally to the models queried.
|
| 196 |
-
|
| 197 |
-
## Hypothesis based on research
|
| 198 |
-
The procrustes alignment must align centerwise, and it must be defined specifically to specifications.
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- geometric-deep-learning
|
| 5 |
+
- vision
|
| 6 |
+
- multi-expert
|
| 7 |
+
- patchwork
|
| 8 |
+
- hypersphere
|
| 9 |
+
- from-scratch
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# GeoLIP ViT Base x3
|
| 13 |
|
| 14 |
+
Geometric vision system: 3-expert consensus soup + from-scratch ViT encoder.
|
| 15 |
|
| 16 |
+
## Components
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
### 1. Base Tier Soup (teacher)
|
|
|
|
| 19 |
|
| 20 |
+
800K parameter geometric fusion of 3 pretrained vision experts on a 128-d hypersphere.
|
|
|
|
| 21 |
|
| 22 |
+
| Expert | Architecture | Training | Dim |
|
| 23 |
+
|--------|-------------|----------|-----|
|
| 24 |
+
| clip_l14_openai | ViT-L/14 | Text-supervised (CLIP) | 768 |
|
| 25 |
+
| dinov2_b14 | ViT-B/14 | Self-supervised (DINO) | 768 |
|
| 26 |
+
| siglip_b16_384 | ViT-B/16 | Sigmoid contrastive (SigLIP) | 768 |
|
| 27 |
|
| 28 |
+
**Pipeline:** GPA alignment at 768-d → PCA to 128-d → per-expert whitened Procrustes calibration → Procrustes-initialized projectors → geometric autograd training.
|
|
|
|
|
|
|
| 29 |
|
| 30 |
+
| Metric | Value |
|
| 31 |
+
|--------|-------|
|
| 32 |
+
| mAP (COCO) | 0.837 |
|
| 33 |
+
| Parameters | 799,952 |
|
| 34 |
+
| Anchors | 256 × 128-d |
|
| 35 |
+
| Consensus CV (768-d) | 0.2793 |
|
| 36 |
+
| Consensus CV (128-d) | 0.2731 |
|
| 37 |
+
| Optimizer | Adam, no weight decay |
|
| 38 |
|
| 39 |
+
### 2. From-Scratch ViT Encoder (student)
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
11M parameter ViT trained from Xavier initialization against the soup's consensus targets. No pretrained weights anywhere. Same architecture pattern as CaptionBERT.
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
| Config | Value |
|
| 44 |
+
|--------|-------|
|
| 45 |
+
| Layers | 6 |
|
| 46 |
+
| Hidden dim | 384 |
|
| 47 |
+
| Heads | 6 |
|
| 48 |
+
| FFN dim | 1536 |
|
| 49 |
+
| Patch size | 16 |
|
| 50 |
+
| Image size | 224 |
|
| 51 |
+
| Output dim | 128 (on hypersphere) |
|
| 52 |
+
| Parameters | 11,216,768 |
|
| 53 |
|
| 54 |
+
**Training:** Raw COCO images → encoder → 128-d embedding → frozen soup pipeline (constellation + patchwork + classifier) → BCE loss. Additional losses: InfoNCE + MSE against consensus targets, whitened Procrustes alignment, pentachoron CV (calibrated to measured consensus).
|
|
|
|
| 55 |
|
| 56 |
+
#### Results (20 epochs, still converging)
|
| 57 |
|
| 58 |
+
| Metric | E1 | E10 | E20 |
|
| 59 |
+
|--------|-----|------|------|
|
| 60 |
+
| nce_acc | 0.340 | 0.887 | 0.972 |
|
| 61 |
+
| cos→consensus | 0.325 | 0.557 | 0.599 |
|
| 62 |
+
| R@1 (5K) | 0.032 | 0.254 | 0.323 |
|
| 63 |
+
| mAP | 0.151 | 0.380 | 0.429 |
|
| 64 |
+
| F1 | 0.162 | 0.361 | 0.418 |
|
| 65 |
+
| Active anchors | 95 | 96 | 94 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
All metrics still climbing at E20. Model needs 60-90 epochs to fully converge (matching CaptionBERT's text encoder trajectory).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
+
## Architecture
|
| 70 |
```
|
| 71 |
+
Training (soup as teacher):
|
| 72 |
+
3 expert features → Procrustes projectors → mean → L2-norm → 128-d consensus targets
|
| 73 |
+
Raw images → from-scratch ViT → 128-d embedding
|
| 74 |
+
Losses: InfoNCE + MSE + CV + BCE(through frozen soup) + Procrustes alignment
|
| 75 |
+
Geometric autograd: tangential=0.01, separation=1.0
|
| 76 |
+
|
| 77 |
+
Inference (standalone):
|
| 78 |
+
Raw image → ViT encoder → 128-d embedding (on hypersphere)
|
| 79 |
+
No experts needed. Geometry is baked in.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
```
|
| 81 |
|
| 82 |
+
## Key Findings
|
| 83 |
|
| 84 |
+
- 800K soup params beat 81.7M (34-expert soup at 0.732 mAP) and 75.6M (34-expert bank at 0.782 mAP)
|
| 85 |
+
- Proper calibration (GPA + whitened Procrustes + measured CV target) is essential — without it, constellation collapses to 1/256 active anchors
|
| 86 |
+
- From-scratch ViT learns the 3-expert consensus representation from raw pixels with the same convergence dynamics as CaptionBERT on text
|
| 87 |
+
- Cross-model weight cosine is 0.000 but activation Procrustes is 0.999 — the models encode identical geometry through completely different weight configurations
|
| 88 |
|
| 89 |
+
## Files
|
| 90 |
|
| 91 |
+
- `base_tier_soup_calibrated.pt` — Trained soup (teacher)
|
| 92 |
+
- `geolip_vit_encoder_e20.pt` — ViT encoder at epoch 20
|
| 93 |
+
- `base_tier_soup_calibrated.py` — Soup training script
|
| 94 |
+
- `vit_encoder_from_scratch.py` — Encoder training script
|
| 95 |
+
- `runs/` — Tensorboard logs
|
| 96 |
|
| 97 |
+
## Data
|
| 98 |
|
| 99 |
+
- Training features: [AbstractPhil/bulk-coco-features](https://huggingface.co/datasets/AbstractPhil/bulk-coco-features)
|
| 100 |
+
- Images: COCO 2017 (118K train, 5K val)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
## Usage
|
| 103 |
+
```python
|
| 104 |
+
import torch
|
| 105 |
|
| 106 |
+
# Load encoder
|
| 107 |
+
ckpt = torch.load("geolip_vit_encoder_e20.pt", weights_only=False)
|
| 108 |
+
# ckpt["encoder_state_dict"] — model weights
|
| 109 |
+
# ckpt["config"] — architecture config
|
| 110 |
+
# ckpt["mAP"], ckpt["cos"], ckpt["r1"] — metrics
|
| 111 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|