GUI-Spotlight: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

Bin Lei; Nuo Xu; Ali Payani; Mingyi Hong; Chunhua Liao; Yu Cao; Caiwen Ding

GUI-Spotlight: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

Bin Lei, Nuo Xu, Ali Payani, Mingyi Hong, Chunhua Liao, Yu Cao, Caiwen Ding

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Visual Grounding, Reinforcement Learning, GUI Agents

TL;DR: A model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy.

Abstract: Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user‑interface (GUI) systems, propelling them beyond controlled simulations into complex, real‑world environments across diverse platforms. Yet their practical usefulness is still constrained by the reliability of visual grounding—the ability to map textual references to precise on‑screen elements. This limitation prevents the system from accurately performing pointer‑level actions such as clicking or dragging. To address it, we introduce GUI‑Spotlight—A model trained for \textit{image-grounded reasoning} that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 9584

Loading