GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI Agents, cooridinate-free visual grounding, VLM, VLA
TL;DR: We propose GUI-Actor, a VLM-based, coordinate-free GUI grounding method with an attention-based action head and verifier, achieving state-of-the-art results and strong generalization.
Abstract: One of the principal challenges in building VLM-powered GUI agents is visual grounding—localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment due to lack of explicit spatial supervision; inability to handle ambiguous supervision targets, as single-point predictions penalize valid variations; and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose **GUI-Actor**, a VLM-based method for coordinate-free GUI grounding. At its core, **GUI-Actor** introduces an attention-based action head that learns to align a dedicated `<ACTOR>` token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that **GUI-Actor** outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, **GUI-Actor-7B** achieves scores of **40.7** with Qwen2-VL and **44.6** with Qwen2.5-VL as backbones, outperforming **UI-TARS-72B (38.1)** on ScreenSpot-Pro, with significantly fewer parameters and training data. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that **GUI-Actor** can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths. Project page: [https://aka.ms/GUI-Actor](https://aka.ms/GUI-Actor)
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 17312
Loading