abdelstark commited on
Commit
22f843c
·
verified ·
1 Parent(s): 8a35d91

Replace broken EUPE ONNX export with validated artifact

Browse files
Files changed (1) hide show
  1. README.md +55 -92
README.md CHANGED
@@ -8,23 +8,21 @@ tags:
8
  - self-supervised
9
  - feature-extraction
10
  - onnx
11
- - distillation
12
  - vit
13
- datasets:
14
- - ILSVRC/imagenet-1k
15
  pipeline_tag: image-feature-extraction
16
  library_name: onnxruntime
17
  ---
18
 
19
- # EUPE ViT-B/16 ONNX Export
20
 
21
- ONNX export of [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) for use with [latent-inspector](https://github.com/AbdelStark/latent-inspector) and ONNX Runtime.
22
 
23
- ## Model
24
 
25
- **EUPE** (Efficient Universal Perception Encoder) is a ViT-B/16 vision encoder distilled from multiple domain-expert foundation models into a single compact backbone. Instead of training a model from scratch on a single objective, EUPE aggregates knowledge from specialist teachers — DINOv2 for classification, depth estimators, segmentation models — into a 1.9B-parameter proxy teacher, then distills that into an efficient 86M-parameter student.
26
 
27
- The result is a compact encoder that performs well across classification, segmentation, depth estimation, and vision-language tasks simultaneously.
28
 
29
  | Property | Value |
30
  |----------|-------|
@@ -35,115 +33,80 @@ The result is a compact encoder that performs well across classification, segmen
35
  | Patch size | 16 px |
36
  | Input size | 224 x 224 |
37
  | Output tokens | 197 (1 CLS + 196 patches) |
38
- | Training data | LVD-1689M |
39
  | Paper | [Zhu et al. 2026](https://arxiv.org/abs/2603.22387) |
40
- | Original repo | [facebookresearch/EUPE](https://github.com/facebookresearch/EUPE) |
41
- | License | [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/) (non-commercial) |
42
 
43
- ## ONNX Export Process
44
 
45
- Exported from the original `facebook/EUPE-ViT-B` checkpoint:
46
 
47
- 1. **Load** via the EUPE library's `DinoVisionTransformer` with `pos_embed_rope_dtype="fp32"` — the default BFloat16 RoPE is incompatible with the TorchScript ONNX exporter
48
- 2. **Wrap** with a module that calls `forward_features()` and concatenates the two output tensors into one:
49
- - `x_norm_clstoken` unsqueezed to `[B, 1, 768]`
50
- - `x_norm_patchtokens` as `[B, 196, 768]`
51
- - Result: `[B, 197, 768]` — CLS at index 0, patches at 1-196
52
- 3. **Export** with PyTorch TorchScript ONNX exporter at **opset 14**
53
- 4. **Simplify** with [onnxsim](https://github.com/daquexian/onnx-simplifier): 2053 **834 nodes**
54
- 5. **Save** with external data: `model.onnx` (graph) + `model.onnx_data` (weights)
55
- 6. **Verify** against PyTorch: max numerical diff = **0.000278**
56
 
57
- ## Files
58
 
59
- | File | Size | Description |
60
- |------|------|-------------|
61
- | `model.onnx` | 172 KB | ONNX graph (opset 14, 834 nodes) |
62
- | `model.onnx_data` | 327 MB | External weight data |
63
 
64
- ## ONNX I/O
65
 
66
- | Direction | Name | Shape | Type |
67
- |-----------|------|-------|------|
68
- | Input | `pixel_values` | `[1, 3, 224, 224]` | float32 |
69
- | Output | `last_hidden_state` | `[1, 197, 768]` | float32 |
70
 
71
- Output layout: index 0 is the L2-normalized CLS token, indices 1-196 are L2-normalized patch tokens on a 14x14 spatial grid.
 
 
 
 
 
 
72
 
73
- ## Usage
74
 
75
- ### With latent-inspector (Rust)
 
 
 
 
 
76
 
77
- ```bash
78
- # Auto-downloads on first use (~327 MB)
79
- latent-inspector inspect photo.jpg --model eupe-vit-b16
80
- latent-inspector compare photo.jpg --models dinov2-vit-l14,eupe-vit-b16
81
- ```
82
 
83
- ### With ONNX Runtime (Python)
 
 
 
 
 
 
84
 
85
  ```python
86
  import onnxruntime as ort
87
  import numpy as np
88
- from PIL import Image
89
- from torchvision import transforms
90
-
91
- transform = transforms.Compose([
92
- transforms.Resize(224, interpolation=transforms.InterpolationMode.LANCZOS),
93
- transforms.CenterCrop(224),
94
- transforms.ToTensor(),
95
- transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
96
- ])
97
-
98
- image = Image.open("photo.jpg").convert("RGB")
99
- tensor = transform(image).unsqueeze(0).numpy()
100
 
101
  session = ort.InferenceSession("model.onnx")
102
- output = session.run(None, {"pixel_values": tensor})[0] # [1, 197, 768]
103
-
104
- cls_token = output[0, 0, :] # Global image representation [768]
105
- patch_tokens = output[0, 1:, :] # Per-region features [196, 768]
106
  ```
107
 
108
- ### With ONNX Runtime (Rust)
109
-
110
- ```rust
111
- let session = ort::session::Session::builder()?
112
- .with_intra_threads(4)?
113
- .commit_from_file("model.onnx")?;
114
-
115
- let input = ndarray::Array4::<f32>::zeros((1, 3, 224, 224));
116
- // ... fill with preprocessed image ...
117
-
118
- let outputs = session.run(ort::inputs!["pixel_values" => input])?;
119
- let hidden = outputs["last_hidden_state"].try_extract_tensor::<f32>()?;
120
- // shape: [1, 197, 768]
121
- ```
122
-
123
- ## Representation Fingerprint
124
-
125
- Compared against other SSL models on the same image (real ONNX inference via [latent-inspector](https://github.com/AbdelStark/latent-inspector)):
126
-
127
- | Metric | EUPE ViT-B/16 | DINOv2 ViT-L/14 | I-JEPA ViT-H/14 | V-JEPA 2 ViT-L/16 |
128
- |--------|--------------|-----------------|------------------|--------------------|
129
- | Effective rank | 17/768 | 60/1024 | 44/1280 | 64/1024 |
130
- | Top-10 variance | 88.8% | 66.8% | 72.7% | 58.1% |
131
- | Patch isotropy | 0.026 | 0.796 | 0.788 | 0.678 |
132
- | CKA vs DINOv2 | 0.044 | 1.000 | 0.329 | 0.358 |
133
-
134
- EUPE produces a highly concentrated representation where patches are nearly all directionally aligned (isotropy 0.026). Multi-teacher distillation converges on a compact universal feature set rather than the spatially diverse representations typical of single-objective models like DINOv2.
135
 
136
  ## Citation
137
 
138
  ```bibtex
139
- @article{zhu2026eupe,
140
  title={Efficient Universal Perception Encoder},
141
- author={Zhu, Chenchen and Suri, Saksham and others},
142
- journal={arXiv preprint arXiv:2603.22387},
143
- year={2026}
 
 
 
144
  }
145
  ```
146
-
147
- ## Acknowledgments
148
-
149
- Original weights by Meta FAIR under the [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/). ONNX conversion and hosting by [@AbdelStark](https://github.com/AbdelStark) for the [latent-inspector](https://github.com/AbdelStark/latent-inspector) project.
 
8
  - self-supervised
9
  - feature-extraction
10
  - onnx
 
11
  - vit
12
+ - proxy-distillation
 
13
  pipeline_tag: image-feature-extraction
14
  library_name: onnxruntime
15
  ---
16
 
17
+ # EUPE ViT-B/16 ONNX Export
18
 
19
+ Corrected ONNX export of [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) for use with ONNX Runtime and [latent-inspector](https://github.com/AbdelStark/latent-inspector).
20
 
21
+ This bundle supersedes the earlier broken export. The current artifact is validated against the upstream PyTorch model on 5 sample images and also passes an input-independence gate.
22
 
23
+ ## Model
24
 
25
+ EUPE (Efficient Universal Perception Encoder) is a ViT-B/16 vision encoder trained with a proxy-distillation pipeline: a compact 86M student distilled from a large proxy teacher that aggregates multiple expert perception models.
26
 
27
  | Property | Value |
28
  |----------|-------|
 
33
  | Patch size | 16 px |
34
  | Input size | 224 x 224 |
35
  | Output tokens | 197 (1 CLS + 196 patches) |
36
+ | Base checkpoint | [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) |
37
  | Paper | [Zhu et al. 2026](https://arxiv.org/abs/2603.22387) |
38
+ | Upstream code | [facebookresearch/eupe](https://github.com/facebookresearch/eupe) |
39
+ | License | [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/) |
40
 
41
+ ## Export method
42
 
43
+ The corrected export path is:
44
 
45
+ 1. Download `EUPE-ViT-B.pt` from the upstream Hugging Face repo.
46
+ 2. Load the model through the official `facebookresearch/eupe` torch.hub entrypoint `eupe_vitb16`.
47
+ 3. Call `forward_features()` and concatenate:
48
+ - `x_norm_clstoken -> [B, 1, 768]`
49
+ - `x_norm_patchtokens -> [B, 196, 768]`
50
+ - final output `last_hidden_state -> [B, 197, 768]`
51
+ 4. Export with the legacy TorchScript ONNX path (`dynamo=False`).
52
+ 5. Save as `model.onnx` + `model.onnx_data`.
 
53
 
54
+ The newer `torch.export` / `dynamo=True` ONNX exporter currently fails on EUPE during decomposition, so this artifact intentionally uses the legacy exporter until the upstream exporter bug is fixed.
55
 
56
+ ## Validation
 
 
 
57
 
58
+ Validation report: [`export.validation.json`](./export.validation.json)
59
 
60
+ The artifact was accepted with these gates:
 
 
 
61
 
62
+ - CLS cosine `>= 0.995`
63
+ - Patch cosine `>= 0.99`
64
+ - CLS mean abs diff `<= 0.03`
65
+ - Patch mean abs diff `<= 0.05`
66
+ - CLS max abs diff `<= 0.5`
67
+ - Patch max abs diff `<= 5.0`
68
+ - Input-independence cosine `< 0.85`
69
 
70
+ Observed export result:
71
 
72
+ - `validation_passed = true`
73
+ - Worst CLS cosine across 5 images: `0.998392`
74
+ - Worst patch cosine across 5 images: `0.994251`
75
+ - Worst CLS mean abs diff: `0.022487`
76
+ - Worst patch mean abs diff: `0.030653`
77
+ - Input-independence cosine: `0.744812`
78
 
79
+ ## Files
 
 
 
 
80
 
81
+ | File | Description |
82
+ |------|-------------|
83
+ | `model.onnx` | ONNX graph |
84
+ | `model.onnx_data` | External tensor data |
85
+ | `export.validation.json` | PyTorch vs ONNX parity report for this export |
86
+
87
+ ## Usage
88
 
89
  ```python
90
  import onnxruntime as ort
91
  import numpy as np
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
  session = ort.InferenceSession("model.onnx")
94
+ pixel_values = np.zeros((1, 3, 224, 224), dtype=np.float32)
95
+ last_hidden_state = session.run(["last_hidden_state"], {"pixel_values": pixel_values})[0]
 
 
96
  ```
97
 
98
+ Output layout: token 0 is the CLS embedding and tokens 1-196 are patch embeddings on a 14x14 grid.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
  ## Citation
101
 
102
  ```bibtex
103
+ @misc{zhu2026eupe,
104
  title={Efficient Universal Perception Encoder},
105
+ author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
106
+ year={2026},
107
+ eprint={2603.22387},
108
+ archivePrefix={arXiv},
109
+ primaryClass={cs.CV},
110
+ url={https://arxiv.org/abs/2603.22387},
111
  }
112
  ```