abdelstark/vjepa2-vitl-img16-256-onnx
Image Feature Extraction • Updated
Parity-validated ONNX exports of self-supervised vision encoders (V-JEPA 2, EUPE) for the latent-inspector Rust CLI.
Note V-JEPA 2 ViT-L/16 — image-native I/O (takes [1,3,256,256], handles the 16-frame tubelet internally). Cleanest drop-in for cross-encoder comparison against DINOv2 / I-JEPA / EUPE.
Note V-JEPA 2 ViT-L/16 — minimal 2-frame variant. Use when you already have video tensors. 304M params, 1024-dim embeddings, 256 patch tokens.
Note EUPE ViT-B/16 — Efficient Universal Perception Encoder (86M params, 768-dim, 197 tokens inc. CLS). Corrected export using the legacy TorchScript exporter after the torch.export path failed on upstream decomposition.