--- license: cc-by-nc-4.0 base_model: - Qwen/Qwen3-VL-235B-A22B-Thinking language: - en pipeline_tag: image-text-to-text library_name: transformers tags: - multimodal - action - pytorch - computer use - gui agents --- # **Holo2: Foundational Models for Navigation and Computer Use Agents** [](https://github.com/hcompai/hai-cookbook/tree/main/holo2) ## **Model Description** **Holo2** represents the next major step in developing large-scale Vision-Language Models (VLMs) for **multi-domain GUI Agents**. These agents can operate real digital environments specifically web, desktop, and mobile by interpreting interfaces, reasoning over content, and executing actions. Our **Holo2** family emphasizes **navigation and task execution** across diverse real and simulated environments, extending beyond static perception to **multi-step, goal-directed behavior**. It builds upon the strengths of **Holo1.5** in UI localization and screen content understanding, with major improvements in **policy learning**, **action grounding**, and **cross-environment generalization**. The **Holo2** series comes in three model sizes: - **Holo2-4B:** fully open under Apache 2.0 - **Holo2-8B:** fully open under Apache 2.0 - **Holo2-30B-A3B:** research-only license (non-commercial). For commercial use, please contact us. - **Holo2-235B-A22B:** research-only license (non-commercial). For commercial use, please contact us. These models are designed to provide reliable, accurate, and efficient foundations for next-generation CU agents, like Surfer-H. - **Developed by:** [**H Company**](https://www.hcompany.ai/) - **Model type:** Vision-Language Model for Navigation and Computer Use Agents - **Fine-tuned from model:** Qwen/Qwen3-VL-235B-A22B-Thinking - **Blog Post:** https://hcompany.ai/holo2-235b-a22b-preview - **License:** Apache 2.0 License ## Get Started with the Model Please have a look at the [cookbook](https://github.com/hcompai/hai-cookbook/tree/main/holo2) in our repo where we provide examples for both self-hosting and API use! ## **Training Strategy** Our models are trained using high-quality proprietary data for UI understanding and action prediction, following a multi-stage training pipeline. The training dataset is a carefully curated mix of open-source datasets, large-scale synthetic data, and human-annotated samples. Training proceeds in two stages: large-scale supervised fine-tuning, followed by online reinforcement learning (GRPO) yielding SOTA performance in interpreting UIs and performing actions on large, complex screens. ## **Agentic Localization** High-resolution 4K interfaces are challenging for localization models. Small UI elements can be difficult to pinpoint on a large display. With agentic localization, Holo2 can iteratively refine its predictions, improving accuracy with each step and unlocking 10-20% relative gains across all Holo2 model sizes. Holo2-235B-A22B reaches 70.6% accuracy on ScreenSpot-Pro in a single step. Within 3 steps, it achieves 78.5%, setting a new state-of-the-art on the most challenging GUI grounding benchmark. ## **Results** ### **Holo2: SOTA UI Localization** UI Localization measures how precisely an agent can locate on-screen elements—buttons, inputs, links—necessary for accurate interaction. Holo2 continues to set new standards for localization accuracy across web, OS, and mobile benchmarks.
Holo2 models performance on the ScreenSpot-Pro benchmark.