Keywords: computer use, GUI agent, task efficiency
Abstract: Driven by advances in Vision-Language Models (VLMs), computer-using agents have recently demonstrated remarkable capabilities in complex reasoning, software control, and the automation of digital workflows.
However, the existing step-by-step paradigm requires extensive interaction with the model, and the resulting query latency emerges as a key bottleneck for real-world adoption.
To address this limitation, we propose that agents should be able to output a sequence of actions after each observation, enabling efficient execution without constant model queries.
In this work, we introduce \ours, a method that transforms standard computer-using models into agents with the capability of action sequence prediction.
To enable this, we design a data collection pipeline tailored for compressed action trajectories in computer-using environments.
Building on this pipeline, we construct a large-scale dataset within the WorkArena benchmark and train computer-using agents for action sequence prediction.
Through extensive experiments, we show that OS-Catalyst enables up to 50\% faster task completion on office-related benchmarks without sacrificing success rate.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3176
Loading