All in One: Unified Pretraining of GUI Agents via Masked Trajectory Prediction

ICLR 2026 Conference Submission6738 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI Agent
Abstract: Graphical User Interface (GUI) agents are intelligent systems that interact with software applications by perceiving visual elements and taking appropriate actions. Existing studies typically explore a wide range of pretraining strategies with heterogeneous corpora and directly unify these tasks through mixture training to enhance the generalization of GUI agents. However, the direct unification of existing pretraining strategies leads to inconsistent training objectives and data heterogeneity, preventing the full potential of each pretraining task from being realized. In this paper, we present a unified framework, \textbf{M}asked \textbf{T}rajectory \textbf{P}rediction (MTP), which consolidates diverse pretraining strategies into a consistent training objective via a masking-based manner. Specifically, we collect open-source GUI corpora that encompass a broad range of logical and semantic coherence, including randomly generated action–screenshot pairs, GUI tutorial data, and human-annotated datasets. Then, MTP models each GUI multi-interaction as a trajectory and defines pretraining objectives through component masking and prediction. Furthermore, to handle the heterogeneity across open-source corpora, we design a role-aware adapter learning module that dynamically routes each token to an appropriate optimization path. Extensive experiments on four representative GUI navigation benchmarks (AndroidControl, GUI-Odyssey, AITZ, and Mind2Web) demonstrate the effectiveness and generalization ability of our framework. By unifying existing pretraining objectives, MTP significantly outperforms prior methods and achieves SOTA results. The code and dataset will be publicly released.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6738
Loading