SpiritSight Agent: Advanced GUI Agent with One Look

Zhiyuan Huang; Harry Ziming Cheng; Junting Pan; Mingjie Zhan

SpiritSight Agent: Advanced GUI Agent with One Look

Zhiyuan Huang, Harry Ziming Cheng, Junting Pan, Mingjie Zhan

27 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: GUI Agent, VLLM, decision-making

Abstract: Graphical User Interface (GUI) Agents show amazing abilities in assisting human-computer interaction, automating human user's navigation on digital devices. An ideal GUI Agent is expected to achieve high accuracy, low latency, and generality across various GUI platforms. Recent visual-based approaches show promises, taking the advantages of advanced Vision Language Models (VLMs). Although they generally meet the requirements of generality and low latency, these visual-based GUI Agents often fall short in terms of localization accuracy. To address this issue, we propose $\textbf{SpiritSight}$, a visual-based generalist end-to-end GUI agent with outstanding grounding abilities. First, we create a multi-level, large-scale, high-quality GUI training dataset with scalable methods and train SpiritSight using curriculum learning, empowering it with robust GUI understanding and localization capabilities. Second, we introduce the $\textbf{Universal Block Parsing (UBP)}$ method, which frames the localization task as a multi-image QA problem, further enhancing SpiritSight's ability to ground GUI objects. With the above-mentioned efforts, SpiritSight constantly outperforms previous SOTA methods across numerous major automated GUI navigation benchmarks. Notably, SpiritSight-8B achieves a 46.1% step Success Rate(SR) on the Mind2Web benchmark without any candidates element input, $\textbf{more than doubling}$ the performance of SeeClick (20.9%) with a comparable model scale. SpiritSight also outperforms other visual-language-based methods in various GUI platforms, demonstrating its superior capability and compatibility in GUI Agent tasks. The models and the code will be made available upon publications.

Primary Area: applications to robotics, autonomy, planning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10526

Loading