--- license: mit pipeline_tag: image-to-3d tags: - 3d - onnx - ios - triposr - 3d-reconstruction - mobile - coreml - on-device ---

# TripoSR iOS (ONNX) **Single image → 3D mesh, on your iPhone.** The ONNX-converted encoder from [TripoSR](https://github.com/VAST-AI-Research/TripoSR) by **Stability AI** × **Tripo AI** — optimized for on-device inference.

419M
_Parameters

1.6 GB
_{Model Size}

< 0.5s
_{Inference (A100)}

ONNX
_Format

MIT
_License

--- ## Demo

Input Photo	3D Output

--- ## Benchmarks Evaluated on [GSO](https://goo.gl/datasets/GoogleScannedObjects) and [OmniObject3D](https://omniobject3d.github.io/) datasets. Results from the [TripoSR paper](https://arxiv.org/abs/2403.02151). ### F-Score @ 0.1 (higher is better) ![F-Score Comparison](assets/chart_fscore.png) ### Chamfer Distance (lower is better) ![Chamfer Distance Comparison](assets/chart_chamfer.png) ### F-Score Across Thresholds ![F-Score Line Chart](assets/chart_line.png) ### Cross-Dataset Comparison ![Grouped Comparison](assets/chart_grouped.png)

Full Results Table

**GSO Dataset** | Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ | |:--|:--|:--|:--|:--| | One-2-3-45 | 0.227 | 0.382 | 0.630 | 0.878 | | OpenLRM | 0.180 | 0.430 | 0.698 | 0.938 | | ZeroShape | 0.160 | 0.489 | 0.757 | 0.952 | | TGS | 0.122 | 0.637 | 0.846 | 0.968 | | **TripoSR** | **0.111** | **0.651** | **0.871** | **0.980** | **OmniObject3D Dataset** | Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ | |:--|:--|:--|:--|:--| | One-2-3-45 | 0.197 | 0.445 | 0.698 | 0.907 | | ZeroShape | 0.144 | 0.507 | 0.786 | 0.968 | | OpenLRM | 0.155 | 0.486 | 0.759 | 0.959 | | TGS | 0.142 | 0.602 | 0.818 | 0.949 | | **TripoSR** | **0.102** | **0.677** | **0.890** | **0.986** |

--- ## Architecture One forward pass — no diffusion, no iterative denoising. ```mermaid graph LR A["Input Image
(512x512)"] --> B["DINO ViT-B/16
Image Tokenizer"] B --> C["Transformer Decoder
+ Cross-Attention"] C --> D["Post Processor
Triplane Features"] D --> E["Marching Cubes
3D Mesh"] style A fill:#4a9eff,stroke:#30363d,color:#fff style B fill:#7c3aed,stroke:#30363d,color:#fff style C fill:#7c3aed,stroke:#30363d,color:#fff style D fill:#7c3aed,stroke:#30363d,color:#fff style E fill:#3fb950,stroke:#30363d,color:#fff ``` | Component | Parameters | Role | |:--|:--|:--| | **DINO ViT-B/16** | ~86M | Pretrained image encoder | | **Transformer Decoder** | ~268M | Cross-attention to image tokens | | **Triplane Post-Processor** | ~65M | Tokens → triplane features `(3x40x64x64)` | --- ## PyTorch vs. This Model | | Original | This Conversion | |:--|:--|:--| | **Format** | PyTorch | ONNX | | **Size** | ~3 GB+ | 1.6 GB | | **Runs on** | GPU server | iPhone / iPad / Mac | | **Dependencies** | torch, einops, transformers | onnxruntime | | **Connectivity** | Cloud API | Fully offline | --- ## What I Learned Getting This to Work Well Getting TripoSR to produce clean 3D meshes on a phone took more work than just converting the model to ONNX. The raw model expects a very specific kind of input — a single object, centered, on a neutral background — and if you just feed it a raw photo, the results are pretty rough. The biggest improvement came from **stripping the background** before inference. I'm using Apple's **Vision framework** (`VNGenerateForegroundInstanceMaskRequest` on iOS 17+) to automatically detect and isolate the main subject. This is the same API that powers the "lift subject from background" feature in Photos — it's fast, runs on-device, and handles edges surprisingly well. The isolated subject gets composited onto a **flat gray background** (RGB 0.5, 0.5, 0.5), which matches what TripoSR was trained on. The second big win was **smart cropping and centering**. After removing the background, I analyze the remaining foreground pixels to find the bounding box, then scale and center the subject so it fills roughly **85-95% of the frame**. Too small and the model loses detail; too large and geometry gets clipped. The fill ratio adapts based on the object's shape — tall/narrow objects get a bit more breathing room, compact objects fill more of the frame. A small amount of padding (2-6%) prevents edge artifacts. I also added a lightweight **image enhancement pipeline** before inference: noise reduction, luminance sharpening, and edge smoothing after the resize. Lanczos resampling (instead of bilinear) for the 512x512 resize made a noticeable difference in preserving fine detail. All of this runs through Core Image with Metal acceleration, so it adds minimal overhead. The full pipeline — background removal, crop, center, enhance, infer — runs entirely on-device in [Haplo AI](https://apps.apple.com/us/app/haplo-ai-offline-private-ai/id6746702574). No server, no internet required. --- ## Quick Start

Python

```python import onnxruntime as ort import numpy as np from PIL import Image session = ort.InferenceSession( "triposr_encoder.onnx", providers=['CPUExecutionProvider'] # or 'CoreMLExecutionProvider' ) image = Image.open("photo.png").convert("RGB").resize((512, 512)) input_array = np.array(image).astype(np.float32) / 255.0 input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...] scene_codes = session.run(None, {"input_image": input_array})[0] # scene_codes.shape == (1, 3, 40, 64, 64) ```

Swift (iOS)

```swift import OnnxRuntimeBindings let session = try ORTSession(env: env, modelPath: modelPath, sessionOptions: nil) let inputTensor = try ORTValue( tensorData: imageData, elementType: .float, shape: [1, 3, 512, 512] ) let outputs = try session.run( withInputs: ["input_image": inputTensor], outputNames: ["scene_codes"] ) ```

--- ## Files | File | Size | Description | |:--|:--|:--| | `triposr_encoder.onnx` | 2.6 MB | Model graph | | `triposr_encoder.onnx.data` | 1.6 GB | Weights | --- ## Citation ```bibtex @article{TripoSR2024, title={TripoSR: Fast 3D Object Reconstruction from a Single Image}, author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan and Letts, Adam and Li, Yangguang and Liang, Ding and Laforte, Christian and Jampani, Varun and Cao, Yan-Pei}, journal={arXiv preprint arXiv:2403.02151}, year={2024} } ```

_{MIT License • Based on TripoSR by Stability AI × Tripo AI}