abdelstark
/

eupe-vit-b16-onnx

@@ -8,23 +8,21 @@ tags:
   - self-supervised
   - feature-extraction
   - onnx
-  - distillation
   - vit
-datasets:
-  - ILSVRC/imagenet-1k
 pipeline_tag: image-feature-extraction
 library_name: onnxruntime
 ---
-# EUPE ViT-B/16 — ONNX Export
-ONNX export of [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) for use with [latent-inspector](https://github.com/AbdelStark/latent-inspector) and ONNX Runtime.
-## Model
-**EUPE** (Efficient Universal Perception Encoder) is a ViT-B/16 vision encoder distilled from multiple domain-expert foundation models into a single compact backbone. Instead of training a model from scratch on a single objective, EUPE aggregates knowledge from specialist teachers — DINOv2 for classification, depth estimators, segmentation models — into a 1.9B-parameter proxy teacher, then distills that into an efficient 86M-parameter student.
-The result is a compact encoder that performs well across classification, segmentation, depth estimation, and vision-language tasks simultaneously.
 | Property | Value |
 |----------|-------|
@@ -35,115 +33,80 @@ The result is a compact encoder that performs well across classification, segmen
 | Patch size | 16 px |
 | Input size | 224 x 224 |
 | Output tokens | 197 (1 CLS + 196 patches) |
-| Training data | LVD-1689M |
 | Paper | [Zhu et al. 2026](https://arxiv.org/abs/2603.22387) |
-| Original repo | [facebookresearch/EUPE](https://github.com/facebookresearch/EUPE) |
-| License | [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/) (non-commercial) |
-## ONNX Export Process
-Exported from the original `facebook/EUPE-ViT-B` checkpoint:
-1. **Load** via the EUPE library's `DinoVisionTransformer` with `pos_embed_rope_dtype="fp32"` — the default BFloat16 RoPE is incompatible with the TorchScript ONNX exporter
-2. **Wrap** with a module that calls `forward_features()` and concatenates the two output tensors into one:
-   - `x_norm_clstoken` unsqueezed to `[B, 1, 768]`
-   - `x_norm_patchtokens` as `[B, 196, 768]`
-   - Result: `[B, 197, 768]` — CLS at index 0, patches at 1-196
-3. **Export** with PyTorch TorchScript ONNX exporter at **opset 14**
-4. **Simplify** with [onnxsim](https://github.com/daquexian/onnx-simplifier): 2053 → **834 nodes**
-5. **Save** with external data: `model.onnx` (graph) + `model.onnx_data` (weights)
-6. **Verify** against PyTorch: max numerical diff = **0.000278**
-## Files
-| File | Size | Description |
-|------|------|-------------|
-| `model.onnx` | 172 KB | ONNX graph (opset 14, 834 nodes) |
-| `model.onnx_data` | 327 MB | External weight data |
-## ONNX I/O
-| Direction | Name | Shape | Type |
-|-----------|------|-------|------|
-| Input | `pixel_values` | `[1, 3, 224, 224]` | float32 |
-| Output | `last_hidden_state` | `[1, 197, 768]` | float32 |
-Output layout: index 0 is the L2-normalized CLS token, indices 1-196 are L2-normalized patch tokens on a 14x14 spatial grid.
-## Usage
-### With latent-inspector (Rust)
-```bash
-# Auto-downloads on first use (~327 MB)
-latent-inspector inspect photo.jpg --model eupe-vit-b16
-latent-inspector compare photo.jpg --models dinov2-vit-l14,eupe-vit-b16
-```
-### With ONNX Runtime (Python)
 ```python
 import onnxruntime as ort
 import numpy as np
-from PIL import Image
-from torchvision import transforms
-transform = transforms.Compose([
-    transforms.Resize(224, interpolation=transforms.InterpolationMode.LANCZOS),
-    transforms.CenterCrop(224),
-    transforms.ToTensor(),
-    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-])
-image = Image.open("photo.jpg").convert("RGB")
-tensor = transform(image).unsqueeze(0).numpy()
 session = ort.InferenceSession("model.onnx")
-output = session.run(None, {"pixel_values": tensor})[0]  # [1, 197, 768]
-cls_token = output[0, 0, :]       # Global image representation [768]
-patch_tokens = output[0, 1:, :]   # Per-region features [196, 768]
 ```
-### With ONNX Runtime (Rust)
-```rust
-let session = ort::session::Session::builder()?
-    .with_intra_threads(4)?
-    .commit_from_file("model.onnx")?;
-let input = ndarray::Array4::<f32>::zeros((1, 3, 224, 224));
-// ... fill with preprocessed image ...
-let outputs = session.run(ort::inputs!["pixel_values" => input])?;
-let hidden = outputs["last_hidden_state"].try_extract_tensor::<f32>()?;
-// shape: [1, 197, 768]
-```
-## Representation Fingerprint
-Compared against other SSL models on the same image (real ONNX inference via [latent-inspector](https://github.com/AbdelStark/latent-inspector)):
-| Metric | EUPE ViT-B/16 | DINOv2 ViT-L/14 | I-JEPA ViT-H/14 | V-JEPA 2 ViT-L/16 |
-|--------|--------------|-----------------|------------------|--------------------|
-| Effective rank | 17/768 | 60/1024 | 44/1280 | 64/1024 |
-| Top-10 variance | 88.8% | 66.8% | 72.7% | 58.1% |
-| Patch isotropy | 0.026 | 0.796 | 0.788 | 0.678 |
-| CKA vs DINOv2 | 0.044 | 1.000 | 0.329 | 0.358 |
-EUPE produces a highly concentrated representation where patches are nearly all directionally aligned (isotropy 0.026). Multi-teacher distillation converges on a compact universal feature set rather than the spatially diverse representations typical of single-objective models like DINOv2.
 ## Citation
 ```bibtex
-@article{zhu2026eupe,
   title={Efficient Universal Perception Encoder},
-  author={Zhu, Chenchen and Suri, Saksham and others},
-  journal={arXiv preprint arXiv:2603.22387},
-  year={2026}
 }
 ```
-## Acknowledgments
-Original weights by Meta FAIR under the [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/). ONNX conversion and hosting by [@AbdelStark](https://github.com/AbdelStark) for the [latent-inspector](https://github.com/AbdelStark/latent-inspector) project.

   - self-supervised
   - feature-extraction
   - onnx
   - vit
+  - proxy-distillation
 pipeline_tag: image-feature-extraction
 library_name: onnxruntime
 ---
+# EUPE ViT-B/16 ONNX Export
+Corrected ONNX export of [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) for use with ONNX Runtime and [latent-inspector](https://github.com/AbdelStark/latent-inspector).
+This bundle supersedes the earlier broken export. The current artifact is validated against the upstream PyTorch model on 5 sample images and also passes an input-independence gate.
+## Model
+EUPE (Efficient Universal Perception Encoder) is a ViT-B/16 vision encoder trained with a proxy-distillation pipeline: a compact 86M student distilled from a large proxy teacher that aggregates multiple expert perception models.
 | Property | Value |
 |----------|-------|
 | Patch size | 16 px |
 | Input size | 224 x 224 |
 | Output tokens | 197 (1 CLS + 196 patches) |
+| Base checkpoint | [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) |
 | Paper | [Zhu et al. 2026](https://arxiv.org/abs/2603.22387) |
+| Upstream code | [facebookresearch/eupe](https://github.com/facebookresearch/eupe) |
+| License | [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/) |
+## Export method
+The corrected export path is:
+1. Download `EUPE-ViT-B.pt` from the upstream Hugging Face repo.
+2. Load the model through the official `facebookresearch/eupe` torch.hub entrypoint `eupe_vitb16`.
+3. Call `forward_features()` and concatenate:
+   - `x_norm_clstoken -> [B, 1, 768]`
+   - `x_norm_patchtokens -> [B, 196, 768]`
+   - final output `last_hidden_state -> [B, 197, 768]`
+4. Export with the legacy TorchScript ONNX path (`dynamo=False`).
+5. Save as `model.onnx` + `model.onnx_data`.
+The newer `torch.export` / `dynamo=True` ONNX exporter currently fails on EUPE during decomposition, so this artifact intentionally uses the legacy exporter until the upstream exporter bug is fixed.
+## Validation
+Validation report: [`export.validation.json`](./export.validation.json)
+The artifact was accepted with these gates:
+- CLS cosine `>= 0.995`
+- Patch cosine `>= 0.99`
+- CLS mean abs diff `<= 0.03`
+- Patch mean abs diff `<= 0.05`
+- CLS max abs diff `<= 0.5`
+- Patch max abs diff `<= 5.0`
+- Input-independence cosine `< 0.85`
+Observed export result:
+- `validation_passed = true`
+- Worst CLS cosine across 5 images: `0.998392`
+- Worst patch cosine across 5 images: `0.994251`
+- Worst CLS mean abs diff: `0.022487`
+- Worst patch mean abs diff: `0.030653`
+- Input-independence cosine: `0.744812`
+## Files
+| File | Description |
+|------|-------------|
+| `model.onnx` | ONNX graph |
+| `model.onnx_data` | External tensor data |
+| `export.validation.json` | PyTorch vs ONNX parity report for this export |
+## Usage
 ```python
 import onnxruntime as ort
 import numpy as np
 session = ort.InferenceSession("model.onnx")
+pixel_values = np.zeros((1, 3, 224, 224), dtype=np.float32)
+last_hidden_state = session.run(["last_hidden_state"], {"pixel_values": pixel_values})[0]
 ```
+Output layout: token 0 is the CLS embedding and tokens 1-196 are patch embeddings on a 14x14 grid.
 ## Citation
 ```bibtex
+@misc{zhu2026eupe,
   title={Efficient Universal Perception Encoder},
+  author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
+  year={2026},
+  eprint={2603.22387},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV},
+  url={https://arxiv.org/abs/2603.22387},
 }
 ```