GUI‑AIMA: Aligning Intrinsic Multi-Modal Attention with a Context Anchor for GUI Grounding

Shijie Zhou; Viet Dac Lai; Hao Tan; Jihyung Kil; Wanrong Zhu; Changyou Chen; Ruiyi Zhang

GUI‑AIMA: Aligning Intrinsic Multi-Modal Attention with a Context Anchor for GUI Grounding

Shijie Zhou, Viet Dac Lai, Hao Tan, Jihyung Kil, Wanrong Zhu, Changyou Chen, Ruiyi Zhang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: GUI grounding, visual grounding, MLLMs

TL;DR: We proposed GUI-AIMA, a attention-based GUI visual grounding method supervised on anchored attention with query-adaptive multi-head weighting.

Abstract: Graphical user interface (GUI) grounding is a key function of computer-use agents, which maps natural-language instructions to actionable screen regions. Existing approaches based on Multimodal Large Language Models (MLLMs) typically formulate it as a text-based coordinate generation task, yet directly generating precise coordinates from visual inputs remains challenging and computationally intensive. An intuitive way to implement GUI grounding is to $\textit{first select visual patches relevant to the instructions and then determine the precise click location within those patches}$. Based on the observations that general MLLMs have native grounding capability, which is highly correlated with query-to-visual attentions, we propose GUI-AIMA, an attention-only and coordinate-free supervised fine-tuning framework for efficient GUI grounding. This framework aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. Specifically, we convert coordinate-based grounding boxes into soft patch-wise labels considering patch overlap and the center click manner. For attention aggregation, we simplify the merging of attention predictions among all query tokens into a single anchored attention vector with learnable $\texttt{\<ANCHOR\>}$ token. More importantly, GUI-AIMA includes a query-adaptive multi-head weighting mechanism for multi-head attention aggregation by prioritizing text-vision affinity heads with visual-sink query tokens. GUI-AIMA-3B, trained on a small training set only with 85k screenshots, achieves the state-of-the-art performance among 3B models ($\textit{i.e.}$,~$\textbf{49.8}$% average on ScreenSpot-Pro, $\textbf{58.3%}$% average on OSWorld-G and and $\textbf{91.5%}$% average on ScreenSpot-v2).

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23638

Loading