---
license: mit
pipeline_tag: image-to-3d
tags:
- 3d
- onnx
- ios
- triposr
- 3d-reconstruction
- mobile
- coreml
- on-device
---
# TripoSR iOS (ONNX)
**Single image → 3D mesh, on your iPhone.**
The ONNX-converted encoder from [TripoSR](https://github.com/VAST-AI-Research/TripoSR) by **Stability AI** × **Tripo AI** — optimized for on-device inference.
419M Parameters |
1.6 GB Model Size |
< 0.5s Inference (A100) |
ONNX Format |
MIT License |
---
## Demo
| Input Photo |
3D Output |
 |
|
---
## Benchmarks
Evaluated on [GSO](https://goo.gl/datasets/GoogleScannedObjects) and [OmniObject3D](https://omniobject3d.github.io/) datasets. Results from the [TripoSR paper](https://arxiv.org/abs/2403.02151).
### F-Score @ 0.1 (higher is better)

### Chamfer Distance (lower is better)

### F-Score Across Thresholds

### Cross-Dataset Comparison

Full Results Table
**GSO Dataset**
| Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ |
|:--|:--|:--|:--|:--|
| One-2-3-45 | 0.227 | 0.382 | 0.630 | 0.878 |
| OpenLRM | 0.180 | 0.430 | 0.698 | 0.938 |
| ZeroShape | 0.160 | 0.489 | 0.757 | 0.952 |
| TGS | 0.122 | 0.637 | 0.846 | 0.968 |
| **TripoSR** | **0.111** | **0.651** | **0.871** | **0.980** |
**OmniObject3D Dataset**
| Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ |
|:--|:--|:--|:--|:--|
| One-2-3-45 | 0.197 | 0.445 | 0.698 | 0.907 |
| ZeroShape | 0.144 | 0.507 | 0.786 | 0.968 |
| OpenLRM | 0.155 | 0.486 | 0.759 | 0.959 |
| TGS | 0.142 | 0.602 | 0.818 | 0.949 |
| **TripoSR** | **0.102** | **0.677** | **0.890** | **0.986** |
---
## Architecture
One forward pass — no diffusion, no iterative denoising.
```mermaid
graph LR
A["Input Image
(512x512)"] --> B["DINO ViT-B/16
Image Tokenizer"]
B --> C["Transformer Decoder
+ Cross-Attention"]
C --> D["Post Processor
Triplane Features"]
D --> E["Marching Cubes
3D Mesh"]
style A fill:#4a9eff,stroke:#30363d,color:#fff
style B fill:#7c3aed,stroke:#30363d,color:#fff
style C fill:#7c3aed,stroke:#30363d,color:#fff
style D fill:#7c3aed,stroke:#30363d,color:#fff
style E fill:#3fb950,stroke:#30363d,color:#fff
```
| Component | Parameters | Role |
|:--|:--|:--|
| **DINO ViT-B/16** | ~86M | Pretrained image encoder |
| **Transformer Decoder** | ~268M | Cross-attention to image tokens |
| **Triplane Post-Processor** | ~65M | Tokens → triplane features `(3x40x64x64)` |
---
## PyTorch vs. This Model
| | Original | This Conversion |
|:--|:--|:--|
| **Format** | PyTorch | ONNX |
| **Size** | ~3 GB+ | 1.6 GB |
| **Runs on** | GPU server | iPhone / iPad / Mac |
| **Dependencies** | torch, einops, transformers | onnxruntime |
| **Connectivity** | Cloud API | Fully offline |
---
## What I Learned Getting This to Work Well
Getting TripoSR to produce clean 3D meshes on a phone took more work than just converting the model to ONNX. The raw model expects a very specific kind of input — a single object, centered, on a neutral background — and if you just feed it a raw photo, the results are pretty rough.
The biggest improvement came from **stripping the background** before inference. I'm using Apple's **Vision framework** (`VNGenerateForegroundInstanceMaskRequest` on iOS 17+) to automatically detect and isolate the main subject. This is the same API that powers the "lift subject from background" feature in Photos — it's fast, runs on-device, and handles edges surprisingly well. The isolated subject gets composited onto a **flat gray background** (RGB 0.5, 0.5, 0.5), which matches what TripoSR was trained on.
The second big win was **smart cropping and centering**. After removing the background, I analyze the remaining foreground pixels to find the bounding box, then scale and center the subject so it fills roughly **85-95% of the frame**. Too small and the model loses detail; too large and geometry gets clipped. The fill ratio adapts based on the object's shape — tall/narrow objects get a bit more breathing room, compact objects fill more of the frame. A small amount of padding (2-6%) prevents edge artifacts.
I also added a lightweight **image enhancement pipeline** before inference: noise reduction, luminance sharpening, and edge smoothing after the resize. Lanczos resampling (instead of bilinear) for the 512x512 resize made a noticeable difference in preserving fine detail. All of this runs through Core Image with Metal acceleration, so it adds minimal overhead.
The full pipeline — background removal, crop, center, enhance, infer — runs entirely on-device in [Haplo AI](https://apps.apple.com/us/app/haplo-ai-offline-private-ai/id6746702574). No server, no internet required.
---
## Quick Start
Python
```python
import onnxruntime as ort
import numpy as np
from PIL import Image
session = ort.InferenceSession(
"triposr_encoder.onnx",
providers=['CPUExecutionProvider'] # or 'CoreMLExecutionProvider'
)
image = Image.open("photo.png").convert("RGB").resize((512, 512))
input_array = np.array(image).astype(np.float32) / 255.0
input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...]
scene_codes = session.run(None, {"input_image": input_array})[0]
# scene_codes.shape == (1, 3, 40, 64, 64)
```
Swift (iOS)
```swift
import OnnxRuntimeBindings
let session = try ORTSession(env: env, modelPath: modelPath, sessionOptions: nil)
let inputTensor = try ORTValue(
tensorData: imageData,
elementType: .float,
shape: [1, 3, 512, 512]
)
let outputs = try session.run(
withInputs: ["input_image": inputTensor],
outputNames: ["scene_codes"]
)
```
---
## Files
| File | Size | Description |
|:--|:--|:--|
| `triposr_encoder.onnx` | 2.6 MB | Model graph |
| `triposr_encoder.onnx.data` | 1.6 GB | Weights |
---
## Citation
```bibtex
@article{TripoSR2024,
title={TripoSR: Fast 3D Object Reconstruction from a Single Image},
author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan
and Letts, Adam and Li, Yangguang and Liang, Ding and Laforte, Christian
and Jampani, Varun and Cao, Yan-Pei},
journal={arXiv preprint arXiv:2403.02151},
year={2024}
}
```
MIT License • Based on TripoSR by Stability AI × Tripo AI