Keywords: Vision-Language Models, Visual Grounding, Reinforcement Learning, GUI Agents
TL;DR: A model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy.
Abstract: Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user‑interface (GUI) systems, propelling them beyond controlled simulations into complex, real‑world environments across diverse platforms.
Yet their practical usefulness is still constrained by the reliability of visual grounding—the ability to map textual references to precise on‑screen elements. This limitation prevents the system from accurately performing pointer‑level actions such as clicking or dragging.
To address it, we introduce GUI‑Spotlight—A model trained for \textit{image-grounded reasoning} that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9584
Loading