---
license: mit
tags:
- 3d-localization
- multi-view
- segmentation
- language-grounding
- pose-free
pipeline_tag: image-segmentation
---

# TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

**Paper:** [arXiv:2603.08096](https://arxiv.org/abs/2603.08096)
**Project Page:** [cwru-aism.github.io/triangulang](https://cwru-aism.github.io/triangulang/)
**Code:** [github.com/bryceag11/triangulang](https://github.com/bryceag11/triangulang)
**Training Data & Caches:** [huggingface.co/datasets/bag100/triangulang-scannetpp-cache](https://huggingface.co/datasets/bag100/triangulang-scannetpp-cache)

*Bryce Grant, Aryeh Rothenberg, Atri Banerjee, Peng Wang*
*Case Western Reserve University*

## Overview

TrianguLang is a feed-forward, pose-free method for language-guided 3D localization from multi-view images. Given unposed images and a text query, it produces per-view segmentation masks and camera-relative 3D locations at ~18 FPS for 5 classes.

## Checkpoints

| Checkpoint | Description |
|---|---|
| `v10/best.pt` | Single-object (text + spatial), 230 scenes, 100 epochs |
| `mo_v11/best.pt` | Multi-object (text + spatial), 230 scenes, 100 epochs |

## Architecture

- **Frozen:** SAM3 (841M) + DA3-NESTED-GIANT-LARGE (1.69B) = ~2.5B params
- **Trainable:** GASA Decoder (~13.5M params)

## Results

### Single-Object (text-only)

| Benchmark | Setting | mIoU | mAcc / Loc. Acc. |
|---|---|---|---|
| **ScanNet++** | In-domain | **62.4%** | 77.4% mAcc |
| **uCO3D** | In-domain | **94.6%** | 98.3% mAcc |
| **uCO3D** | Cross-domain (ScanNet++ &rarr; uCO3D) | **75.7%** | 79.6% mAcc |
| **LERF-OVS** | Zero-shot (no LERF training) | **59.2%** | 89.1% Loc. Acc. |
| **NVOS** | Zero-shot | **93.5%** | — |
| **SPIn-NeRF** | Zero-shot | **91.4%** | — |

### Multi-Object (text-only, ScanNet++)

| Setting | mIoU | mAcc |
|---|---|---|
| Text-only (multi-object) | **65.2%** | **79.1%** |

### LERF-OVS Per-Scene (zero-shot)

| Method | Ramen | Teatime | Kitchen | Figurines | Overall mIoU | Overall Loc. Acc. |
|---|---|---|---|---|---|---|
| LERF | 28.2 | 45.0 | 37.9 | 38.6 | 37.4 | 73.6 |
| LangSplat | 51.2 | 65.1 | 44.5 | 44.7 | 51.4 | 84.3 |
| LangSplat-V2 | **51.8** | **72.2** | 59.1 | 56.4 | **59.9** | 84.1 |
| **TrianguLang** | 51.1 | 58.9 | **62.4** | **62.1** | 59.2 | **89.1** |

*Note: Per-scene methods (LERF, LangSplat) require calibrated poses and 10-45 min per-scene optimization. TrianguLang runs feed-forward in ~58ms.*

## Citation

```bibtex
@article{grant2026triangulang,
  title={TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization},
  author={Grant, Bryce and Rothenberg, Aryeh and Banerjee, Atri and Wang, Peng},
  journal={arXiv preprint arXiv:2603.08096},
  year={2026}
}
```