Keywords: Visual Reasoning, Visual Search, GUI Agent, GUI Grouding
TL;DR: We propose a visual search method to improve GUI grounding in professional high-resolution computer use, almost tripling the performance of previous SOTA model.
Abstract: Multi-modal large language models (MLLMs) are rapidly advancing in visual understanding and reasoning, enhancing GUI agents for tasks such as web browsing and mobile interactions. However, these agents depend on reasoning skills for action planning but only rely on the model capability for UI grounding (localizing the target element). These grounding models struggle with high-resolution displays, small targets, and complex environments. In this work, we introduce a novel method to improve MLLMs’ grounding performance in high-resolution, complex UI environments using a visual search approach based on visual reasoning. Additionally, we create a new benchmark, dubbed ScreenSpot-Pro, designed to comprehensively evaluate model capabilities in professional high-resolution settings. This benchmark consists of real-world high-resolution images and expert-annotated tasks from diverse professional domains. Our experiments show that existing GUI grounding models perform poorly on this dataset, with the best achieving only 18.9\%, whereas our visual-reasoning strategy significantly improves performance, reaching 48.1\% without any additional training.
Submission Number: 100
Loading