SAGE-GUI: Training-Free Structure-Aware Grounding Enhancement for GUI Agents

ACL ARR 2026 May Submission15756 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI grounding, vision-language models, multimodal agents, screen understanding
Abstract: Reliable execution of graphical user interface agents requires accurate grounding of each action to the corresponding interface target. Existing methods commonly formulate GUI grounding as a screenshot-to-click localization task. However, the correct target is often determined by action semantics, element attributes, local structural context, and inter-element relations that extend beyond visual or textual similarity. We propose SAGE-GUI, a training-free and plug-and-play module for structure-aware GUI grounding. The proposed method constructs an action-oriented UI representation, resolves action targets over structured interface objects, and employs conservative visual arbitration for verification and recovery. In addition, we introduce StructSpot, a multi-level diagnostic benchmark designed to evaluate grounding performance under varying structural complexities, ranging from text matching to relational target resolution. Experiments on StructSpot and ScreenSpot demonstrate that SAGE-GUI consistently improves the performance of existing grounding backends without requiring model retraining, thereby highlighting the importance of structure-aware action-target resolution for GUI grounding.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal application
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 15756
Loading