ManiCoG: Training-Free Improvement for GUI Grounding via Manipulation Chains

Borui Zhang; Bo Zhang; Bo Wang; Wenzhao Zheng; Yuhao Cheng; Liang Tang; Yiqiang Yan; Jie Zhou; Jiwen Lu

ManiCoG: Training-Free Improvement for GUI Grounding via Manipulation Chains

Borui Zhang, Bo Zhang, Bo Wang, Wenzhao Zheng, Yuhao Cheng, Liang Tang, Yiqiang Yan, Jie Zhou, Jiwen Lu

04 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: GUI Grounding, Test-Time Scaling, GUI Agent

Abstract: GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed Masked Prediction Distribution (MPD) attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce the Manipulation-based Chain of GUI Grounding (ManiCoG), which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that ManiCoG significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9\% to 57.8\%. Furthermore, ablation studies confirm the robustness of the ManiCoG approach across diverse parameter configurations, highlighting its stability and effectiveness.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2126

Loading