GOLD: Global Overview to Local Detail in Efficient Visual Grounding for GUI Agents

Mingyu Kim; Jeonghoon Park; Hojun Lee; Taesik Gong

GOLD: Global Overview to Local Detail in Efficient Visual Grounding for GUI Agents

Mingyu Kim, Jeonghoon Park, Hojun Lee, Taesik Gong

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Agent, GUI Agent, GUI Grounding, VLM

TL;DR: Efficient GUI Grounding Framework

Abstract: Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have recently emerged as a promising direction for multimodal automation. However, VLM-based GUI grounding incurs substantial computational overhead, making deployment on edge devices infeasible and leading to prohibitively high cloud serving costs. Prior attempts to reduce background or history vision tokens partially alleviate this issue, but either rely on sparsity in foreground elements or require extensive fine-tuning. In this work, we present GOLD, Global Overview to Local Detail for efficient GUI grounding that is tuning-free and robust across varying interface densities. GOLD operates in three stages. At the Global Pruning Stage, we downsample GUI screenshots and feed them into the VLM to identify relevant regions, thereby achieving efficient context reduction. In the Local Refinement Stage, only crops of detected regions are passed to the VLM at high resolution. To retain broader contexts, we aggregate the outputs of both stages to integrate both global and local information in Global-Local Context Fusion Stage. Experimental results show that GOLD reduces TFLOPs by 78%, while even improving accuracy by 0.7%p when it is integrated into the state-of-the-art GUI grounding method on the ScreenSpot-V2 benchmark. These findings highlight the efficiency of our global-to-local grounding framework.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6702

Loading