---
license: cc-by-nc-4.0
base_model:
- Qwen/Qwen3-VL-235B-A22B-Thinking
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- multimodal
- action
- pytorch
- computer use
- gui agents
---

# **Holo2: Foundational Models for Navigation and Computer Use Agents**


[![GitHub](https://img.shields.io/badge/Holo2_Cookbook-100000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/hcompai/hai-cookbook/tree/main/holo2)

## **Model Description**

**Holo2** represents the next major step in developing large-scale Vision-Language Models (VLMs) for **multi-domain GUI Agents**.
These agents can operate real digital environments specifically web, desktop, and mobile by interpreting interfaces, reasoning over content, and executing actions.

Our **Holo2** family emphasizes **navigation and task execution** across diverse real and simulated environments, extending beyond static perception to **multi-step, goal-directed behavior**.  

It builds upon the strengths of **Holo1.5** in UI localization and screen content understanding, with major improvements in **policy learning**, **action grounding**, and **cross-environment generalization**.

The **Holo2** series comes in three model sizes:

- **Holo2-4B:** fully open under Apache 2.0
- **Holo2-8B:** fully open under Apache 2.0
- **Holo2-30B-A3B:** research-only license (non-commercial). For commercial use, please contact us.
- **Holo2-235B-A22B:** research-only license (non-commercial). For commercial use, please contact us.

These models are designed to provide reliable, accurate, and efficient foundations for next-generation CU agents, like Surfer-H.

- **Developed by:** [**H Company**](https://www.hcompany.ai/)
- **Model type:** Vision-Language Model for Navigation and Computer Use Agents
- **Fine-tuned from model:** Qwen/Qwen3-VL-235B-A22B-Thinking
- **Blog Post:** https://hcompany.ai/holo2-235b-a22b-preview
- **License:** Apache 2.0 License


## Get Started with the Model

Please have a look at the [cookbook](https://github.com/hcompai/hai-cookbook/tree/main/holo2) in our repo where we provide examples for both self-hosting and API use!


## **Training Strategy**

Our models are trained using high-quality proprietary data for UI understanding and action prediction, following a multi-stage training pipeline. The training dataset is a carefully curated mix of open-source datasets, large-scale synthetic data, and human-annotated samples. Training proceeds in two stages: large-scale supervised fine-tuning, followed by online reinforcement learning (GRPO) yielding SOTA performance in interpreting UIs and performing actions on large, complex screens.


## **Agentic Localization**

High-resolution 4K interfaces are challenging for localization models. Small UI elements can be difficult to pinpoint on a large display. With agentic localization, Holo2 can iteratively refine its predictions, improving accuracy with each step and unlocking 10-20% relative gains across all Holo2 model sizes.

Holo2-235B-A22B reaches 70.6% accuracy on ScreenSpot-Pro in a single step. Within 3 steps, it achieves 78.5%, setting a new state-of-the-art on the most challenging GUI grounding benchmark.

## **Results**

### **Holo2: SOTA UI Localization**

UI Localization measures how precisely an agent can locate on-screen elements—buttons, inputs, links—necessary for accurate interaction.  
Holo2 continues to set new standards for localization accuracy across web, OS, and mobile benchmarks.

<div align="center">

|                           | ScreenSpot-Pro | OSWorld-G | Showdown | Ground-UI-1K | WebClick-v1 | ScreenSpot-v2 | Average  |
|---------------------------|----------------|-----------|----------|--------------|-------------|---------------|----------|
| Holo2-235B-A22B (Agentic) | **78.5%**      | -         | -        | -            | -           | -             | -        |
| Holo2-235B-A22B           | 70.6%          | **79.0%** | **80.4%**| **85.5%**    | **94.3%**   | **95.9%**     | **84.28**|
| Holo2-30B-A3B (Agentic)   | 75.2%          | -         | -        | -            | -           | -             | -        |
| Holo2-30B-A3B             | 66.1%          | 76.1%     | 77.6%    | **85.5%**    | 91.3%       | 94.9%         | 81.90    |
| Holo2-8B (Agentic)        | 71.4%          | -         | -        | -            | -           | -             | -        |
| Holo2-8B                  | 58.9%          | 70.1%     | 72.5%    | 83.8%        | 89.5%       | 93.2%         | 78.00    |
| Holo2-4B (Agentic)        | 68.6%          | -         | -        | -            | -           | -             | -        |
| Holo2-4B                  | 57.2%          | 69.4%     | 74.7%    | 83.3%        | 88.8%       | 93.2%         | 77.77    |
| Qwen3-VL-235B-A22B-Thinking | 61.8%        | 68.3%     | 78.4%    | 85.2%        | 92.1%       | 95.4%         | 80.20    |
| Qwen3-VL-30B-A3B-Thinking | 49.9%          | 65.8%     | 71.2%    | 84.2%        | 89.5%       | 91.8%         | 75.40    |
| Qwen3-VL-8B-Thinking      | 38.5%          | 56.0%     | 64.2%    | 83.6%        | 85.9%       | 91.5%         | 69.95    |
| Qwen3-VL-4B-Thinking      | 41.4%          | 56.4%     | 66.6%    | 84.1%        | 85.8%       | 90.0%         | 70.72    |
| MAI-UI-32B (+Zoom-In)     | 73.5%          | 70.9%     | -        | -            | -           | -             | -        |
| MAI-UI-32B                | 67.9%          | 67.6%     | -        | -            | -           | -             | -        |
| MAI-UI-8B (+Zoom-In)      | 70.9%          | 64.2%     | -        | -            | -           | -             | -        |
| MAI-UI-8B                 | 65.8%          | 60.1%     | -        | -            | -           | -             | -        |
| MAI-UI-2B (+Zoom-In)      | 62.8%          | 55.9%     | -        | -            | -           | -             | -        |
| MAI-UI-2B                 | 57.4%          | 52.0%     | -        | -            | -           | -             | -        |

Table 1: Localization benchmark scores for leading models.
</div>

<p align="center"><img width=1000 src="https://cdn-uploads.huggingface.co/production/uploads/6808a8cf6b8c599b583d0fe9/8jR7MZmNePY3J_wX4NWrZ.png"/><em>Holo2 models performance on the ScreenSpot-Pro benchmark.</em></p>

---


## Citation

```bibtex
@misc{hai2025holo2modelfamily,
      title={Holo2 - Open Foundation Models for Navigation and Computer Use Agents}, 
      author={H Company},
      year={2025},
      url=https://huggingface.co/collections/Hcompany/holo2, 
}