GOLD: Global Overview to Local Detail in Efficient Visual Grounding for GUI Agents

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Agent, GUI Agent, GUI Grounding, VLM
TL;DR: Efficient GUI Grounding Framework
Abstract: Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have recently emerged as a promising direction for multimodal automation. However, VLM-based GUI grounding incurs substantial computational overhead, making deployment on edge devices infeasible and leading to prohibitively high cloud serving costs. Prior attempts to reduce background or history vision tokens partially alleviate this issue, but either rely on sparsity in foreground elements or require extensive fine-tuning. In this work, we present GOLD, Global Overview to Local Detail for efficient GUI grounding that is tuning-free and robust across varying interface densities. GOLD operates in three stages. At the Global Pruning Stage, we downsample GUI screenshots and feed them into the VLM to identify relevant regions, thereby achieving efficient context reduction. In the Local Refinement Stage, only crops of detected regions are passed to the VLM at high resolution. To retain broader contexts, we aggregate the outputs of both stages to integrate both global and local information in Global-Local Context Fusion Stage. Experimental results show that GOLD reduces TFLOPs by 78%, while even improving accuracy by 0.7%p when it is integrated into the state-of-the-art GUI grounding method on the ScreenSpot-V2 benchmark. These findings highlight the efficiency of our global-to-local grounding framework.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6702
Loading