UItron: Foundational GUI Agent with Advanced Perception and Planning

16 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: applications, GUI agent, language agent, GUI grounding, executable language grounding
TL;DR: Through innovative data pipelines and training frameworks, the proposed UItron has made significant advancements in both Chinese and English GUI agent interaction scenarios.
Abstract: The GUI agent aims to enable automated operations on mobile and PC devices, which is an important task as part of the broader goal of achieving artificial general intelligence. The rapid advancement of visual language models has accelerated the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the lack of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develops a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the proficiency in interaction with top-tier Chinese mobile Apps, we manually collect over one million steps of operation trajectories across the top 100 most popular Apps, and build offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese App scenarios, propelling GUI agents one step closer to real-world applications.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7429
Loading