Keywords: GUI grounding, visual grounding, MLLMs
TL;DR: We proposed GUI-AIMA, a attention-based GUI visual grounding method supervised on anchored attention with query-adaptive multi-head weighting.
Abstract: Graphical user interface (GUI) grounding is a key function of computer-use agents, which maps natural-language instructions to actionable screen regions. Existing approaches based on Multimodal Large Language Models (MLLMs) typically formulate it as a text-based coordinate generation task, yet directly generating precise coordinates from visual inputs remains challenging and computationally intensive. An intuitive way to implement GUI grounding is to $\textit{first select visual patches relevant to the instructions and then determine the precise click location within those patches}$. Based on the observations that general MLLMs have native grounding capability, which is highly correlated with query-to-visual attentions, we propose GUI-AIMA, an attention-only and coordinate-free supervised fine-tuning framework for efficient GUI grounding. This framework aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals.
Specifically, we convert coordinate-based grounding boxes into soft patch-wise labels considering patch overlap and the center click manner. For attention aggregation, we simplify the merging of attention predictions among all query tokens into a single anchored attention vector with learnable $\texttt{\<ANCHOR\>}$ token.
More importantly, GUI-AIMA includes a query-adaptive multi-head weighting mechanism for multi-head attention aggregation by prioritizing text-vision affinity heads with visual-sink query tokens.
GUI-AIMA-3B, trained on a small training set only with 85k screenshots, achieves the state-of-the-art performance among 3B models ($\textit{i.e.}$,~$\textbf{49.8}$% average on ScreenSpot-Pro, $\textbf{58.3%}$% average on OSWorld-G and and $\textbf{91.5%}$% average on ScreenSpot-v2).
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23638
Loading