---
title: Asset Harvester
emoji: "\U0001F697"
colorFrom: green
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
short_description: Image-to-3D for autonomous-vehicle simulation assets
---

# Asset Harvester

[**Paper**](https://arxiv.org/abs/2604.18468) | [**Project Page**](https://research.nvidia.com/labs/sil/projects/asset-harvester/) | [**Code**](https://github.com/NVIDIA/asset-harvester) | [**Model**](https://huggingface.co/nvidia/asset-harvester) | [**Data**](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles-NCore)

Upload one image of a single object (vehicle, pedestrian, cyclist, or other road object) and get back a complete 3D Gaussian splat asset ready for simulation.

## Pipeline

```
upload ─▶ image guard (optional) ─▶ object segmentation ─▶ recenter + pad
                                                              │
                                                              ▼
              3D Gaussian splat ◀── TokenGS lifting ◀── multiview diffusion ◀── camera estimation
```

1. **Object segmentation** (`AH_object_seg_jit.pt`) — Mask2Former JIT produces a binary mask of the foreground object at the uploaded image's native resolution.
2. **Camera estimation** (`AH_camera_estimator.safetensors`) — predicts camera pose, distance, FOV, and object dimensions (LWH). Shares the C-RADIO backbone with multiview diffusion to avoid loading it twice.
3. **Multiview diffusion** (`AH_multiview_diffusion.safetensors`) — SparseViewDiT generates 16 novel orbit views conditioned on the input image.
4. **TokenGS lifting** (`AH_tokengs_lifting.safetensors`) — feed-forward 3D Gaussian reconstructor lifts the 16 views to a full 3DGS asset.

## Outputs

- Multiview MP4 (16-frame orbit at 5fps).
- 3D Gaussian orbit render (MP4).
- Gaussian splat (PLY) ready for simulation engines.

## Hardware

Single NVIDIA GPU with compute capability ≥ 8.0 and ≥ 30 GB VRAM. Typical end-to-end runtime: **1-2 minutes** per image on A100/H100.

## Limitations

- Single-object only — images with multiple distinct subjects will use the largest mask and discard the rest.
- Heavily occluded objects or out-of-distribution subjects (e.g., objects not seen in driving logs) may produce hallucinated geometry.
- Image guard uses `meta-llama/Llama-Guard-3-11B-Vision` — enabling it adds ~20-30 s per run.

## Local deployment

```bash
docker build --build-arg HF_TOKEN=$HF_TOKEN -t asset-harvester .
docker run --gpus all -e HF_TOKEN=$HF_TOKEN -p 7860:7860 asset-harvester
```

Checkpoints are downloaded from [`nvidia/asset-harvester`](https://huggingface.co/nvidia/asset-harvester) on first run. `HF_TOKEN` must have access to that repo.

## Governing terms

Use of this system is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).