--- license: cc-by-nc-4.0 base_model: - Qwen/Qwen3-VL-235B-A22B-Thinking language: - en pipeline_tag: image-text-to-text library_name: transformers tags: - multimodal - action - pytorch - computer use - gui agents --- # **Holo2: Foundational Models for Navigation and Computer Use Agents** [![GitHub](https://img.shields.io/badge/Holo2_Cookbook-100000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/hcompai/hai-cookbook/tree/main/holo2) ## **Model Description** **Holo2** represents the next major step in developing large-scale Vision-Language Models (VLMs) for **multi-domain GUI Agents**. These agents can operate real digital environments specifically web, desktop, and mobile by interpreting interfaces, reasoning over content, and executing actions. Our **Holo2** family emphasizes **navigation and task execution** across diverse real and simulated environments, extending beyond static perception to **multi-step, goal-directed behavior**. It builds upon the strengths of **Holo1.5** in UI localization and screen content understanding, with major improvements in **policy learning**, **action grounding**, and **cross-environment generalization**. The **Holo2** series comes in three model sizes: - **Holo2-4B:** fully open under Apache 2.0 - **Holo2-8B:** fully open under Apache 2.0 - **Holo2-30B-A3B:** research-only license (non-commercial). For commercial use, please contact us. - **Holo2-235B-A22B:** research-only license (non-commercial). For commercial use, please contact us. These models are designed to provide reliable, accurate, and efficient foundations for next-generation CU agents, like Surfer-H. - **Developed by:** [**H Company**](https://www.hcompany.ai/) - **Model type:** Vision-Language Model for Navigation and Computer Use Agents - **Fine-tuned from model:** Qwen/Qwen3-VL-235B-A22B-Thinking - **Blog Post:** https://hcompany.ai/holo2-235b-a22b-preview - **License:** Apache 2.0 License ## Get Started with the Model Please have a look at the [cookbook](https://github.com/hcompai/hai-cookbook/tree/main/holo2) in our repo where we provide examples for both self-hosting and API use! ## **Training Strategy** Our models are trained using high-quality proprietary data for UI understanding and action prediction, following a multi-stage training pipeline. The training dataset is a carefully curated mix of open-source datasets, large-scale synthetic data, and human-annotated samples. Training proceeds in two stages: large-scale supervised fine-tuning, followed by online reinforcement learning (GRPO) yielding SOTA performance in interpreting UIs and performing actions on large, complex screens. ## **Agentic Localization** High-resolution 4K interfaces are challenging for localization models. Small UI elements can be difficult to pinpoint on a large display. With agentic localization, Holo2 can iteratively refine its predictions, improving accuracy with each step and unlocking 10-20% relative gains across all Holo2 model sizes. Holo2-235B-A22B reaches 70.6% accuracy on ScreenSpot-Pro in a single step. Within 3 steps, it achieves 78.5%, setting a new state-of-the-art on the most challenging GUI grounding benchmark. ## **Results** ### **Holo2: SOTA UI Localization** UI Localization measures how precisely an agent can locate on-screen elements—buttons, inputs, links—necessary for accurate interaction. Holo2 continues to set new standards for localization accuracy across web, OS, and mobile benchmarks.
| | ScreenSpot-Pro | OSWorld-G | Showdown | Ground-UI-1K | WebClick-v1 | ScreenSpot-v2 | Average | |---------------------------|----------------|-----------|----------|--------------|-------------|---------------|----------| | Holo2-235B-A22B (Agentic) | **78.5%** | - | - | - | - | - | - | | Holo2-235B-A22B | 70.6% | **79.0%** | **80.4%**| **85.5%** | **94.3%** | **95.9%** | **84.28**| | Holo2-30B-A3B (Agentic) | 75.2% | - | - | - | - | - | - | | Holo2-30B-A3B | 66.1% | 76.1% | 77.6% | **85.5%** | 91.3% | 94.9% | 81.90 | | Holo2-8B (Agentic) | 71.4% | - | - | - | - | - | - | | Holo2-8B | 58.9% | 70.1% | 72.5% | 83.8% | 89.5% | 93.2% | 78.00 | | Holo2-4B (Agentic) | 68.6% | - | - | - | - | - | - | | Holo2-4B | 57.2% | 69.4% | 74.7% | 83.3% | 88.8% | 93.2% | 77.77 | | Qwen3-VL-235B-A22B-Thinking | 61.8% | 68.3% | 78.4% | 85.2% | 92.1% | 95.4% | 80.20 | | Qwen3-VL-30B-A3B-Thinking | 49.9% | 65.8% | 71.2% | 84.2% | 89.5% | 91.8% | 75.40 | | Qwen3-VL-8B-Thinking | 38.5% | 56.0% | 64.2% | 83.6% | 85.9% | 91.5% | 69.95 | | Qwen3-VL-4B-Thinking | 41.4% | 56.4% | 66.6% | 84.1% | 85.8% | 90.0% | 70.72 | | MAI-UI-32B (+Zoom-In) | 73.5% | 70.9% | - | - | - | - | - | | MAI-UI-32B | 67.9% | 67.6% | - | - | - | - | - | | MAI-UI-8B (+Zoom-In) | 70.9% | 64.2% | - | - | - | - | - | | MAI-UI-8B | 65.8% | 60.1% | - | - | - | - | - | | MAI-UI-2B (+Zoom-In) | 62.8% | 55.9% | - | - | - | - | - | | MAI-UI-2B | 57.4% | 52.0% | - | - | - | - | - | Table 1: Localization benchmark scores for leading models.

Holo2 models performance on the ScreenSpot-Pro benchmark.

--- ## Citation ```bibtex @misc{hai2025holo2modelfamily, title={Holo2 - Open Foundation Models for Navigation and Computer Use Agents}, author={H Company}, year={2025}, url=https://huggingface.co/collections/Hcompany/holo2, }