Keywords: GUI Grounding, GUI Parsing
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have substantially improved GUI grounding tasks. However, the following challenges still exist in prior methods: (1) They predict coordinates as discrete tokens in an autoregressive text generation paradigm, which constrains grounding accuracy and leads to sub-optimal inference efficiency; (2) Their predictions are restricted to predefined element sets, and lack the ability to comprehensively parse the entire interface, thereby impeding the versatility and generalizability required for downstream applications. To address these challenges, we introduce Grounding GUI Anything (GGA), an efficient end-to-end framework that enables semantically-aware and fine-grained interface parsing with continuous coordinate decoding. By bridging the MLLM with a dedicated regression-based decoder, the enhanced visual and textual representations are jointly leveraged to regress target coordinates within a continuous spatial domain. This design overcomes the quantization and sequential limitations of traditional discrete token modeling, thus enhancing both localization accuracy and inference speed. Furthermore, to improve robustness and mitigate hallucination, we incorporate a rejection mechanism that enables the model to identify non-existent elements. To facilitate systematic evaluation, we introduce ScreenParse, a comprehensive benchmark designed to assess the structural perception capabilities of GUI grounding models across diverse real-world scenarios. Extensive experiments on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks demonstrate that GGA consistently achieves superior performance compared to existing state-of-the-art methods. All resources will be made publicly available for future research.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8589
Loading