Replace broken EUPE ONNX export with validated artifact
Browse files
README.md
CHANGED
|
@@ -8,23 +8,21 @@ tags:
|
|
| 8 |
- self-supervised
|
| 9 |
- feature-extraction
|
| 10 |
- onnx
|
| 11 |
-
- distillation
|
| 12 |
- vit
|
| 13 |
-
|
| 14 |
-
- ILSVRC/imagenet-1k
|
| 15 |
pipeline_tag: image-feature-extraction
|
| 16 |
library_name: onnxruntime
|
| 17 |
---
|
| 18 |
|
| 19 |
-
# EUPE ViT-B/16
|
| 20 |
|
| 21 |
-
ONNX export of [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) for use with [latent-inspector](https://github.com/AbdelStark/latent-inspector)
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
| Property | Value |
|
| 30 |
|----------|-------|
|
|
@@ -35,115 +33,80 @@ The result is a compact encoder that performs well across classification, segmen
|
|
| 35 |
| Patch size | 16 px |
|
| 36 |
| Input size | 224 x 224 |
|
| 37 |
| Output tokens | 197 (1 CLS + 196 patches) |
|
| 38 |
-
|
|
| 39 |
| Paper | [Zhu et al. 2026](https://arxiv.org/abs/2603.22387) |
|
| 40 |
-
|
|
| 41 |
-
| License | [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/)
|
| 42 |
|
| 43 |
-
##
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
1.
|
| 48 |
-
2.
|
| 49 |
-
|
| 50 |
-
- `
|
| 51 |
-
-
|
| 52 |
-
|
| 53 |
-
4.
|
| 54 |
-
5.
|
| 55 |
-
6. **Verify** against PyTorch: max numerical diff = **0.000278**
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|------|------|-------------|
|
| 61 |
-
| `model.onnx` | 172 KB | ONNX graph (opset 14, 834 nodes) |
|
| 62 |
-
| `model.onnx_data` | 327 MB | External weight data |
|
| 63 |
|
| 64 |
-
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|-----------|------|-------|------|
|
| 68 |
-
| Input | `pixel_values` | `[1, 3, 224, 224]` | float32 |
|
| 69 |
-
| Output | `last_hidden_state` | `[1, 197, 768]` | float32 |
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
-
|
| 74 |
|
| 75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
-
|
| 78 |
-
# Auto-downloads on first use (~327 MB)
|
| 79 |
-
latent-inspector inspect photo.jpg --model eupe-vit-b16
|
| 80 |
-
latent-inspector compare photo.jpg --models dinov2-vit-l14,eupe-vit-b16
|
| 81 |
-
```
|
| 82 |
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
```python
|
| 86 |
import onnxruntime as ort
|
| 87 |
import numpy as np
|
| 88 |
-
from PIL import Image
|
| 89 |
-
from torchvision import transforms
|
| 90 |
-
|
| 91 |
-
transform = transforms.Compose([
|
| 92 |
-
transforms.Resize(224, interpolation=transforms.InterpolationMode.LANCZOS),
|
| 93 |
-
transforms.CenterCrop(224),
|
| 94 |
-
transforms.ToTensor(),
|
| 95 |
-
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
|
| 96 |
-
])
|
| 97 |
-
|
| 98 |
-
image = Image.open("photo.jpg").convert("RGB")
|
| 99 |
-
tensor = transform(image).unsqueeze(0).numpy()
|
| 100 |
|
| 101 |
session = ort.InferenceSession("model.onnx")
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
cls_token = output[0, 0, :] # Global image representation [768]
|
| 105 |
-
patch_tokens = output[0, 1:, :] # Per-region features [196, 768]
|
| 106 |
```
|
| 107 |
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
```rust
|
| 111 |
-
let session = ort::session::Session::builder()?
|
| 112 |
-
.with_intra_threads(4)?
|
| 113 |
-
.commit_from_file("model.onnx")?;
|
| 114 |
-
|
| 115 |
-
let input = ndarray::Array4::<f32>::zeros((1, 3, 224, 224));
|
| 116 |
-
// ... fill with preprocessed image ...
|
| 117 |
-
|
| 118 |
-
let outputs = session.run(ort::inputs!["pixel_values" => input])?;
|
| 119 |
-
let hidden = outputs["last_hidden_state"].try_extract_tensor::<f32>()?;
|
| 120 |
-
// shape: [1, 197, 768]
|
| 121 |
-
```
|
| 122 |
-
|
| 123 |
-
## Representation Fingerprint
|
| 124 |
-
|
| 125 |
-
Compared against other SSL models on the same image (real ONNX inference via [latent-inspector](https://github.com/AbdelStark/latent-inspector)):
|
| 126 |
-
|
| 127 |
-
| Metric | EUPE ViT-B/16 | DINOv2 ViT-L/14 | I-JEPA ViT-H/14 | V-JEPA 2 ViT-L/16 |
|
| 128 |
-
|--------|--------------|-----------------|------------------|--------------------|
|
| 129 |
-
| Effective rank | 17/768 | 60/1024 | 44/1280 | 64/1024 |
|
| 130 |
-
| Top-10 variance | 88.8% | 66.8% | 72.7% | 58.1% |
|
| 131 |
-
| Patch isotropy | 0.026 | 0.796 | 0.788 | 0.678 |
|
| 132 |
-
| CKA vs DINOv2 | 0.044 | 1.000 | 0.329 | 0.358 |
|
| 133 |
-
|
| 134 |
-
EUPE produces a highly concentrated representation where patches are nearly all directionally aligned (isotropy 0.026). Multi-teacher distillation converges on a compact universal feature set rather than the spatially diverse representations typical of single-objective models like DINOv2.
|
| 135 |
|
| 136 |
## Citation
|
| 137 |
|
| 138 |
```bibtex
|
| 139 |
-
@
|
| 140 |
title={Efficient Universal Perception Encoder},
|
| 141 |
-
author={Zhu, Chenchen and Suri, Saksham and
|
| 142 |
-
|
| 143 |
-
|
|
|
|
|
|
|
|
|
|
| 144 |
}
|
| 145 |
```
|
| 146 |
-
|
| 147 |
-
## Acknowledgments
|
| 148 |
-
|
| 149 |
-
Original weights by Meta FAIR under the [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/). ONNX conversion and hosting by [@AbdelStark](https://github.com/AbdelStark) for the [latent-inspector](https://github.com/AbdelStark/latent-inspector) project.
|
|
|
|
| 8 |
- self-supervised
|
| 9 |
- feature-extraction
|
| 10 |
- onnx
|
|
|
|
| 11 |
- vit
|
| 12 |
+
- proxy-distillation
|
|
|
|
| 13 |
pipeline_tag: image-feature-extraction
|
| 14 |
library_name: onnxruntime
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# EUPE ViT-B/16 ONNX Export
|
| 18 |
|
| 19 |
+
Corrected ONNX export of [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) for use with ONNX Runtime and [latent-inspector](https://github.com/AbdelStark/latent-inspector).
|
| 20 |
|
| 21 |
+
This bundle supersedes the earlier broken export. The current artifact is validated against the upstream PyTorch model on 5 sample images and also passes an input-independence gate.
|
| 22 |
|
| 23 |
+
## Model
|
| 24 |
|
| 25 |
+
EUPE (Efficient Universal Perception Encoder) is a ViT-B/16 vision encoder trained with a proxy-distillation pipeline: a compact 86M student distilled from a large proxy teacher that aggregates multiple expert perception models.
|
| 26 |
|
| 27 |
| Property | Value |
|
| 28 |
|----------|-------|
|
|
|
|
| 33 |
| Patch size | 16 px |
|
| 34 |
| Input size | 224 x 224 |
|
| 35 |
| Output tokens | 197 (1 CLS + 196 patches) |
|
| 36 |
+
| Base checkpoint | [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) |
|
| 37 |
| Paper | [Zhu et al. 2026](https://arxiv.org/abs/2603.22387) |
|
| 38 |
+
| Upstream code | [facebookresearch/eupe](https://github.com/facebookresearch/eupe) |
|
| 39 |
+
| License | [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/) |
|
| 40 |
|
| 41 |
+
## Export method
|
| 42 |
|
| 43 |
+
The corrected export path is:
|
| 44 |
|
| 45 |
+
1. Download `EUPE-ViT-B.pt` from the upstream Hugging Face repo.
|
| 46 |
+
2. Load the model through the official `facebookresearch/eupe` torch.hub entrypoint `eupe_vitb16`.
|
| 47 |
+
3. Call `forward_features()` and concatenate:
|
| 48 |
+
- `x_norm_clstoken -> [B, 1, 768]`
|
| 49 |
+
- `x_norm_patchtokens -> [B, 196, 768]`
|
| 50 |
+
- final output `last_hidden_state -> [B, 197, 768]`
|
| 51 |
+
4. Export with the legacy TorchScript ONNX path (`dynamo=False`).
|
| 52 |
+
5. Save as `model.onnx` + `model.onnx_data`.
|
|
|
|
| 53 |
|
| 54 |
+
The newer `torch.export` / `dynamo=True` ONNX exporter currently fails on EUPE during decomposition, so this artifact intentionally uses the legacy exporter until the upstream exporter bug is fixed.
|
| 55 |
|
| 56 |
+
## Validation
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
Validation report: [`export.validation.json`](./export.validation.json)
|
| 59 |
|
| 60 |
+
The artifact was accepted with these gates:
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
+
- CLS cosine `>= 0.995`
|
| 63 |
+
- Patch cosine `>= 0.99`
|
| 64 |
+
- CLS mean abs diff `<= 0.03`
|
| 65 |
+
- Patch mean abs diff `<= 0.05`
|
| 66 |
+
- CLS max abs diff `<= 0.5`
|
| 67 |
+
- Patch max abs diff `<= 5.0`
|
| 68 |
+
- Input-independence cosine `< 0.85`
|
| 69 |
|
| 70 |
+
Observed export result:
|
| 71 |
|
| 72 |
+
- `validation_passed = true`
|
| 73 |
+
- Worst CLS cosine across 5 images: `0.998392`
|
| 74 |
+
- Worst patch cosine across 5 images: `0.994251`
|
| 75 |
+
- Worst CLS mean abs diff: `0.022487`
|
| 76 |
+
- Worst patch mean abs diff: `0.030653`
|
| 77 |
+
- Input-independence cosine: `0.744812`
|
| 78 |
|
| 79 |
+
## Files
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
| File | Description |
|
| 82 |
+
|------|-------------|
|
| 83 |
+
| `model.onnx` | ONNX graph |
|
| 84 |
+
| `model.onnx_data` | External tensor data |
|
| 85 |
+
| `export.validation.json` | PyTorch vs ONNX parity report for this export |
|
| 86 |
+
|
| 87 |
+
## Usage
|
| 88 |
|
| 89 |
```python
|
| 90 |
import onnxruntime as ort
|
| 91 |
import numpy as np
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
session = ort.InferenceSession("model.onnx")
|
| 94 |
+
pixel_values = np.zeros((1, 3, 224, 224), dtype=np.float32)
|
| 95 |
+
last_hidden_state = session.run(["last_hidden_state"], {"pixel_values": pixel_values})[0]
|
|
|
|
|
|
|
| 96 |
```
|
| 97 |
|
| 98 |
+
Output layout: token 0 is the CLS embedding and tokens 1-196 are patch embeddings on a 14x14 grid.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
## Citation
|
| 101 |
|
| 102 |
```bibtex
|
| 103 |
+
@misc{zhu2026eupe,
|
| 104 |
title={Efficient Universal Perception Encoder},
|
| 105 |
+
author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
|
| 106 |
+
year={2026},
|
| 107 |
+
eprint={2603.22387},
|
| 108 |
+
archivePrefix={arXiv},
|
| 109 |
+
primaryClass={cs.CV},
|
| 110 |
+
url={https://arxiv.org/abs/2603.22387},
|
| 111 |
}
|
| 112 |
```
|
|
|
|
|
|
|
|
|
|
|
|