--- license: mit tags: - 3d-localization - multi-view - segmentation - language-grounding - pose-free pipeline_tag: image-segmentation --- # TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization **Paper:** [arXiv:2603.08096](https://arxiv.org/abs/2603.08096) **Project Page:** [cwru-aism.github.io/triangulang](https://cwru-aism.github.io/triangulang/) **Code:** [github.com/bryceag11/triangulang](https://github.com/bryceag11/triangulang) **Training Data & Caches:** [huggingface.co/datasets/bag100/triangulang-scannetpp-cache](https://huggingface.co/datasets/bag100/triangulang-scannetpp-cache) *Bryce Grant, Aryeh Rothenberg, Atri Banerjee, Peng Wang* *Case Western Reserve University* ## Overview TrianguLang is a feed-forward, pose-free method for language-guided 3D localization from multi-view images. Given unposed images and a text query, it produces per-view segmentation masks and camera-relative 3D locations at ~18 FPS for 5 classes. ## Checkpoints | Checkpoint | Description | |---|---| | `v10/best.pt` | Single-object (text + spatial), 230 scenes, 100 epochs | | `mo_v11/best.pt` | Multi-object (text + spatial), 230 scenes, 100 epochs | ## Architecture - **Frozen:** SAM3 (841M) + DA3-NESTED-GIANT-LARGE (1.69B) = ~2.5B params - **Trainable:** GASA Decoder (~13.5M params) ## Results ### Single-Object (text-only) | Benchmark | Setting | mIoU | mAcc / Loc. Acc. | |---|---|---|---| | **ScanNet++** | In-domain | **62.4%** | 77.4% mAcc | | **uCO3D** | In-domain | **94.6%** | 98.3% mAcc | | **uCO3D** | Cross-domain (ScanNet++ → uCO3D) | **75.7%** | 79.6% mAcc | | **LERF-OVS** | Zero-shot (no LERF training) | **59.2%** | 89.1% Loc. Acc. | | **NVOS** | Zero-shot | **93.5%** | — | | **SPIn-NeRF** | Zero-shot | **91.4%** | — | ### Multi-Object (text-only, ScanNet++) | Setting | mIoU | mAcc | |---|---|---| | Text-only (multi-object) | **65.2%** | **79.1%** | ### LERF-OVS Per-Scene (zero-shot) | Method | Ramen | Teatime | Kitchen | Figurines | Overall mIoU | Overall Loc. Acc. | |---|---|---|---|---|---|---| | LERF | 28.2 | 45.0 | 37.9 | 38.6 | 37.4 | 73.6 | | LangSplat | 51.2 | 65.1 | 44.5 | 44.7 | 51.4 | 84.3 | | LangSplat-V2 | **51.8** | **72.2** | 59.1 | 56.4 | **59.9** | 84.1 | | **TrianguLang** | 51.1 | 58.9 | **62.4** | **62.1** | 59.2 | **89.1** | *Note: Per-scene methods (LERF, LangSplat) require calibrated poses and 10-45 min per-scene optimization. TrianguLang runs feed-forward in ~58ms.* ## Citation ```bibtex @article{grant2026triangulang, title={TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization}, author={Grant, Bryce and Rothenberg, Aryeh and Banerjee, Atri and Wang, Peng}, journal={arXiv preprint arXiv:2603.08096}, year={2026} } ```