---
license: cc-by-nc-4.0
library_name: pytorch
tags:
- 3d-gaussian-splatting
- neural-rendering
- novel-view-synthesis
- transformer
- pytorch_model_hub_mixin
- model_hub_mixin
pipeline_tag: image-to-image
datasets:
- ShapeSplats/Objaverse_Splats
base_model:
- microsoft/renderformer-v1.1-swin-large
---

# GaussianFormer-V10b

GaussianFormer is a proof-of-concept neural renderer that takes a 3D Gaussian
Splatting scene as input and synthesizes novel views without per-scene
optimization at inference time. It adapts
[RenderFormer](https://huggingface.co/microsoft/renderformer-v1.1-swin-large)
(SIGGRAPH 2025) — a transformer-based neural renderer designed for triangle
meshes — by replacing its mesh input encoder with a Gaussian-native module that
maps each Gaussian's 14-dimensional parameter vector to a scene token. The
two-stage transformer (view-independent scene encoder + view-dependent ray
decoder) is otherwise unchanged.

This checkpoint is **V10b epoch 26**: an LPIPS-VGG perceptual fine-tune (loss
weight 0.2) of V9 epoch 60.

**Code & docs:** [github.com/SVLwoof/gaussianformer](https://github.com/SVLwoof/gaussianformer)

## Lineage

- Trained from the RenderFormer architecture (`microsoft/renderformer-v1.1-swin-large`)
  with the mesh input encoder replaced by a learned linear projection over
  14-d Gaussian parameters: `[pos(3), scale(3), quat(4), rgb(3), opacity(1)]`.
- **V9 (pretrain).** Single-object 3DGS scans from
  [Objaverse_Splats](https://huggingface.co/datasets/ShapeSplats/Objaverse_Splats)
  (2,667 train / 183 val), pure log-HDR L1 loss, 60 epochs on 3 GPUs.
- **V10b (this checkpoint).** Fine-tuned from V9 ep60 with a combined log-HDR
  L1 + LPIPS-VGG loss on tonemapped LDR output, 26 epochs on 4 GPUs (cosine LR
  5e-5 → 5e-7).

## How to use

```python
import torch
import h5py, numpy as np
from gaussianformer.pipelines.rendering_pipeline import GaussianFormerRenderingPipeline

device = torch.device("cuda")
pipeline = GaussianFormerRenderingPipeline.from_pretrained("shahafvl/gaussianformer-v10b")
pipeline.to(device)

# Load an HDF5 scene with fields: means [N,3], scales [N,3], rotations [N,4]
# (w,x,y,z), colors [N,3], opacities [N,1], c2w [V,4,4], fov [V].
with h5py.File("scene.h5", "r") as f:
    g = torch.cat([
        torch.from_numpy(np.array(f["means"]).astype(np.float32)),
        torch.from_numpy(np.array(f["scales"]).astype(np.float32)),
        torch.from_numpy(np.array(f["rotations"]).astype(np.float32)),
        torch.from_numpy(np.array(f["colors"]).astype(np.float32)),
        torch.from_numpy(np.array(f["opacities"]).astype(np.float32)).reshape(-1, 1),
    ], dim=-1).unsqueeze(0).to(device)
    mask = torch.ones(g.shape[:2], dtype=torch.bool, device=device)
    c2w = torch.from_numpy(np.array(f["c2w"]).astype(np.float32)).unsqueeze(0).to(device)
    fov = torch.from_numpy(np.array(f["fov"]).astype(np.float32))[..., None].unsqueeze(0).to(device)

imgs = pipeline(gaussians=g, mask=mask, c2w=c2w, fov=fov,
                resolution=512, torch_dtype=torch.float16)
# imgs: [1, V, H, W, 3], linear HDR
```

For a complete CLI inference script (with tone mapping + EXR/PNG output) see
[`infer_gaussian.py`](https://github.com/SVLwoof/gaussianformer/blob/main/infer_gaussian.py)
in the project repo.

## Headline results

PSNR (dB, vs full-gsplat ground truth) on a real-world object scan
(*Tomatoes*) across token budgets *N*:

| Model            | N=5k  | N=10k | N=20k | N=30k |
|------------------|-------|-------|-------|-------|
| V6 (multi-obj)   | 21.78 | 21.95 | 22.54 | 23.00 |
| V9 ep60 (L1)     | 25.90 | 26.94 | 27.74 | 28.24 |
| **V10b ep26**    | 25.80 | 26.70 | 27.41 | 27.81 |

V10b is the **best perceptual** model (LPIPS-trained); V9 ep60 is the best raw
PSNR model. The slight PSNR regression vs V9 is expected — LPIPS gradients
optimize for feature-space similarity, not pixel fidelity, and the visual
sharpness gain is the goal.

## Limitations

- The transformer's self-attention is O(N²), so input is capped to N≤30k
  Gaussians per scene via a LightGaussian-style importance score. Higher fidelity
  needs an attention scheme that scales beyond this.
- Trained on **isolated single objects** (Objaverse_Splats). Multi-object scenes,
  large-scale captures, and unbounded backgrounds degrade significantly.
- The model is a proof of concept. It still trails standard rasterized 3DGS in
  output quality on the Tomatoes evaluation.

## Training data license

Trained on the [Objaverse_Splats](https://huggingface.co/datasets/ShapeSplats/Objaverse_Splats)
subset of [Objaverse](https://objaverse.allenai.org/), whose terms restrict
commercial use. The CC-BY-NC-4.0 license on this checkpoint inherits that
restriction. For research/non-commercial use only.

## Citation

GaussianFormer is built on RenderFormer; if you use this checkpoint in academic
work, please cite the original RenderFormer paper:

```bibtex
@inproceedings{zeng2025renderformer,
  title     = {RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination},
  author    = {Chong Zeng and Yue Dong and Pieter Peers and Hongzhi Wu and Xin Tong},
  booktitle = {ACM SIGGRAPH 2025 Conference Papers},
  year      = {2025}
}
```

The GaussianFormer adaptation, training pipeline, and this checkpoint are
documented at [github.com/SVLwoof/gaussianformer](https://github.com/SVLwoof/gaussianformer).