Keywords: GUI grounding, vision-language models, multimodal agents, screen understanding
Abstract: Reliable execution of graphical user interface agents requires accurate grounding of each action to the corresponding interface target.
Existing methods commonly formulate GUI grounding as a screenshot-to-click localization task. However, the correct target is often determined by action semantics, element attributes, local structural context, and inter-element relations that extend beyond visual or textual similarity.
We propose SAGE-GUI, a training-free and plug-and-play module for structure-aware GUI grounding.
The proposed method constructs an action-oriented UI representation, resolves action targets over structured interface objects, and employs conservative visual arbitration for verification and recovery.
In addition, we introduce StructSpot, a multi-level diagnostic benchmark designed to evaluate grounding performance under varying structural complexities, ranging from text matching to relational target resolution.
Experiments on StructSpot and ScreenSpot demonstrate that SAGE-GUI consistently improves the performance of existing grounding backends without requiring model retraining, thereby highlighting the importance of structure-aware action-target resolution for GUI grounding.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal application
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 15756
Loading