--- license: cc-by-nc-4.0 library_name: pytorch tags: - 3d-gaussian-splatting - neural-rendering - novel-view-synthesis - transformer - pytorch_model_hub_mixin - model_hub_mixin pipeline_tag: image-to-image datasets: - ShapeSplats/Objaverse_Splats base_model: - microsoft/renderformer-v1.1-swin-large --- # GaussianFormer-V10b GaussianFormer is a proof-of-concept neural renderer that takes a 3D Gaussian Splatting scene as input and synthesizes novel views without per-scene optimization at inference time. It adapts [RenderFormer](https://huggingface.co/microsoft/renderformer-v1.1-swin-large) (SIGGRAPH 2025) — a transformer-based neural renderer designed for triangle meshes — by replacing its mesh input encoder with a Gaussian-native module that maps each Gaussian's 14-dimensional parameter vector to a scene token. The two-stage transformer (view-independent scene encoder + view-dependent ray decoder) is otherwise unchanged. This checkpoint is **V10b epoch 26**: an LPIPS-VGG perceptual fine-tune (loss weight 0.2) of V9 epoch 60. **Code & docs:** [github.com/SVLwoof/gaussianformer](https://github.com/SVLwoof/gaussianformer) ## Lineage - Trained from the RenderFormer architecture (`microsoft/renderformer-v1.1-swin-large`) with the mesh input encoder replaced by a learned linear projection over 14-d Gaussian parameters: `[pos(3), scale(3), quat(4), rgb(3), opacity(1)]`. - **V9 (pretrain).** Single-object 3DGS scans from [Objaverse_Splats](https://huggingface.co/datasets/ShapeSplats/Objaverse_Splats) (2,667 train / 183 val), pure log-HDR L1 loss, 60 epochs on 3 GPUs. - **V10b (this checkpoint).** Fine-tuned from V9 ep60 with a combined log-HDR L1 + LPIPS-VGG loss on tonemapped LDR output, 26 epochs on 4 GPUs (cosine LR 5e-5 → 5e-7). ## How to use ```python import torch import h5py, numpy as np from gaussianformer.pipelines.rendering_pipeline import GaussianFormerRenderingPipeline device = torch.device("cuda") pipeline = GaussianFormerRenderingPipeline.from_pretrained("shahafvl/gaussianformer-v10b") pipeline.to(device) # Load an HDF5 scene with fields: means [N,3], scales [N,3], rotations [N,4] # (w,x,y,z), colors [N,3], opacities [N,1], c2w [V,4,4], fov [V]. with h5py.File("scene.h5", "r") as f: g = torch.cat([ torch.from_numpy(np.array(f["means"]).astype(np.float32)), torch.from_numpy(np.array(f["scales"]).astype(np.float32)), torch.from_numpy(np.array(f["rotations"]).astype(np.float32)), torch.from_numpy(np.array(f["colors"]).astype(np.float32)), torch.from_numpy(np.array(f["opacities"]).astype(np.float32)).reshape(-1, 1), ], dim=-1).unsqueeze(0).to(device) mask = torch.ones(g.shape[:2], dtype=torch.bool, device=device) c2w = torch.from_numpy(np.array(f["c2w"]).astype(np.float32)).unsqueeze(0).to(device) fov = torch.from_numpy(np.array(f["fov"]).astype(np.float32))[..., None].unsqueeze(0).to(device) imgs = pipeline(gaussians=g, mask=mask, c2w=c2w, fov=fov, resolution=512, torch_dtype=torch.float16) # imgs: [1, V, H, W, 3], linear HDR ``` For a complete CLI inference script (with tone mapping + EXR/PNG output) see [`infer_gaussian.py`](https://github.com/SVLwoof/gaussianformer/blob/main/infer_gaussian.py) in the project repo. ## Headline results PSNR (dB, vs full-gsplat ground truth) on a real-world object scan (*Tomatoes*) across token budgets *N*: | Model | N=5k | N=10k | N=20k | N=30k | |------------------|-------|-------|-------|-------| | V6 (multi-obj) | 21.78 | 21.95 | 22.54 | 23.00 | | V9 ep60 (L1) | 25.90 | 26.94 | 27.74 | 28.24 | | **V10b ep26** | 25.80 | 26.70 | 27.41 | 27.81 | V10b is the **best perceptual** model (LPIPS-trained); V9 ep60 is the best raw PSNR model. The slight PSNR regression vs V9 is expected — LPIPS gradients optimize for feature-space similarity, not pixel fidelity, and the visual sharpness gain is the goal. ## Limitations - The transformer's self-attention is O(N²), so input is capped to N≤30k Gaussians per scene via a LightGaussian-style importance score. Higher fidelity needs an attention scheme that scales beyond this. - Trained on **isolated single objects** (Objaverse_Splats). Multi-object scenes, large-scale captures, and unbounded backgrounds degrade significantly. - The model is a proof of concept. It still trails standard rasterized 3DGS in output quality on the Tomatoes evaluation. ## Training data license Trained on the [Objaverse_Splats](https://huggingface.co/datasets/ShapeSplats/Objaverse_Splats) subset of [Objaverse](https://objaverse.allenai.org/), whose terms restrict commercial use. The CC-BY-NC-4.0 license on this checkpoint inherits that restriction. For research/non-commercial use only. ## Citation GaussianFormer is built on RenderFormer; if you use this checkpoint in academic work, please cite the original RenderFormer paper: ```bibtex @inproceedings{zeng2025renderformer, title = {RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination}, author = {Chong Zeng and Yue Dong and Pieter Peers and Hongzhi Wu and Xin Tong}, booktitle = {ACM SIGGRAPH 2025 Conference Papers}, year = {2025} } ``` The GaussianFormer adaptation, training pipeline, and this checkpoint are documented at [github.com/SVLwoof/gaussianformer](https://github.com/SVLwoof/gaussianformer).