File size: 3,922 Bytes
8ab1f85
 
 
 
 
 
 
 
 
 
 
22f843c
8ab1f85
 
 
 
22f843c
8ab1f85
22f843c
8ab1f85
22f843c
8ab1f85
22f843c
8ab1f85
22f843c
8ab1f85
 
 
 
 
 
 
 
 
 
22f843c
8ab1f85
22f843c
 
8ab1f85
22f843c
8ab1f85
22f843c
8ab1f85
22f843c
 
 
 
 
 
 
 
8ab1f85
22f843c
8ab1f85
22f843c
8ab1f85
22f843c
8ab1f85
22f843c
8ab1f85
22f843c
 
 
 
 
 
 
8ab1f85
22f843c
8ab1f85
22f843c
 
 
 
 
 
8ab1f85
22f843c
8ab1f85
22f843c
 
 
 
 
 
 
8ab1f85
 
 
 
 
 
22f843c
 
8ab1f85
 
22f843c
8ab1f85
 
 
 
22f843c
8ab1f85
22f843c
 
 
 
 
 
8ab1f85
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: other
license_name: fair-noncommercial-research
license_link: https://huggingface.co/facebook/fair-noncommercial-research-license/
base_model: facebook/EUPE-ViT-B
tags:
  - vision
  - self-supervised
  - feature-extraction
  - onnx
  - vit
  - proxy-distillation
pipeline_tag: image-feature-extraction
library_name: onnxruntime
---

# EUPE ViT-B/16 ONNX Export

Corrected ONNX export of [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) for use with ONNX Runtime and [latent-inspector](https://github.com/AbdelStark/latent-inspector).

This bundle supersedes the earlier broken export. The current artifact is validated against the upstream PyTorch model on 5 sample images and also passes an input-independence gate.

## Model

EUPE (Efficient Universal Perception Encoder) is a ViT-B/16 vision encoder trained with a proxy-distillation pipeline: a compact 86M student distilled from a large proxy teacher that aggregates multiple expert perception models.

| Property | Value |
|----------|-------|
| Architecture | ViT-B/16 |
| Parameters | 86M |
| Embedding dimension | 768 |
| Layers / Heads | 12 / 12 |
| Patch size | 16 px |
| Input size | 224 x 224 |
| Output tokens | 197 (1 CLS + 196 patches) |
| Base checkpoint | [facebook/EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) |
| Paper | [Zhu et al. 2026](https://arxiv.org/abs/2603.22387) |
| Upstream code | [facebookresearch/eupe](https://github.com/facebookresearch/eupe) |
| License | [FAIR Research License](https://huggingface.co/facebook/fair-noncommercial-research-license/) |

## Export method

The corrected export path is:

1. Download `EUPE-ViT-B.pt` from the upstream Hugging Face repo.
2. Load the model through the official `facebookresearch/eupe` torch.hub entrypoint `eupe_vitb16`.
3. Call `forward_features()` and concatenate:
   - `x_norm_clstoken -> [B, 1, 768]`
   - `x_norm_patchtokens -> [B, 196, 768]`
   - final output `last_hidden_state -> [B, 197, 768]`
4. Export with the legacy TorchScript ONNX path (`dynamo=False`).
5. Save as `model.onnx` + `model.onnx_data`.

The newer `torch.export` / `dynamo=True` ONNX exporter currently fails on EUPE during decomposition, so this artifact intentionally uses the legacy exporter until the upstream exporter bug is fixed.

## Validation

Validation report: [`export.validation.json`](./export.validation.json)

The artifact was accepted with these gates:

- CLS cosine `>= 0.995`
- Patch cosine `>= 0.99`
- CLS mean abs diff `<= 0.03`
- Patch mean abs diff `<= 0.05`
- CLS max abs diff `<= 0.5`
- Patch max abs diff `<= 5.0`
- Input-independence cosine `< 0.85`

Observed export result:

- `validation_passed = true`
- Worst CLS cosine across 5 images: `0.998392`
- Worst patch cosine across 5 images: `0.994251`
- Worst CLS mean abs diff: `0.022487`
- Worst patch mean abs diff: `0.030653`
- Input-independence cosine: `0.744812`

## Files

| File | Description |
|------|-------------|
| `model.onnx` | ONNX graph |
| `model.onnx_data` | External tensor data |
| `export.validation.json` | PyTorch vs ONNX parity report for this export |

## Usage

```python
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx")
pixel_values = np.zeros((1, 3, 224, 224), dtype=np.float32)
last_hidden_state = session.run(["last_hidden_state"], {"pixel_values": pixel_values})[0]
```

Output layout: token 0 is the CLS embedding and tokens 1-196 are patch embeddings on a 14x14 grid.

## Citation

```bibtex
@misc{zhu2026eupe,
  title={Efficient Universal Perception Encoder},
  author={Zhu, Chenchen and Suri, Saksham and Jose, Cijo and Oquab, Maxime and Szafraniec, Marc and Wen, Wei and Xiong, Yunyang and Labatut, Patrick and Bojanowski, Piotr and Krishnamoorthi, Raghuraman and Chandra, Vikas},
  year={2026},
  eprint={2603.22387},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.22387},
}
```