ScaleTrack: Scaling and Backtracking Automated GUI Agents

ScaleTrack: Scaling and Backtracking Automated GUI Agents

ACL ARR 2026 January Submission2787 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Applications

Abstract: Automated GUI agents aim to streamline user interaction by automatically executing complex tasks across digital environments, such as web, mobile, and desktop devices. Given a textual task instruction and a GUI description, the agent generates a sequence of executable actions (\emph{e.g.}, clicks) and operation steps. Training a GUI agent typically involves grounding and planning stages, where the GUI grounding stage focuses on locating the executable interface coordinates based on the given task, while the planning stage aims to predict the next action using the history of previous actions. However, previous work suffers from insufficient training data for GUI grounding and overlooks the importance of backtracking historical behaviors in GUI planning. To handle the above challenges, we propose ScaleTrack, a training framework that integrates scalable grounding and backtracking planning for automated GUI agents. Specifically, we systematically collected GUI samples from a wide range of sources, with each source employing distinct synthesis criteria, and unified them into a standardized template for training GUI grounding models. Moreover, ScaleTrack introduces a novel training strategy that predicts the next action based on the current GUI image while simultaneously backtracking the historical actions that led to it. This approach enables ScaleTrack to effectively capture the correspondence between GUI states and actions, modeling the dynamic evolution of the GUI environment. Extensive experimental results on grounding tasks, as well as both offline and online agent evaluations, demonstrate the effectiveness of ScaleTrack.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: Dialogue and Interactive Systems

Contribution Types: Publicly available software and/or pre-trained models, Data analysis

Languages Studied: English

Submission Number: 2787

Loading