OS-Catalyst: Advancing Computer-Using Agents Efficiency through Adaptive Action Compression

08 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: computer use, GUI agent, task efficiency
Abstract: Driven by advances in Vision-Language Models (VLMs), computer-using agents have recently demonstrated remarkable capabilities in complex reasoning, software control, and the automation of digital workflows. However, the existing step-by-step paradigm requires extensive interaction with the model, and the resulting query latency emerges as a key bottleneck for real-world adoption. To address this limitation, we propose that agents should be able to output a sequence of actions after each observation, enabling efficient execution without constant model queries. In this work, we introduce \ours, a method that transforms standard computer-using models into agents with the capability of action sequence prediction. To enable this, we design a data collection pipeline tailored for compressed action trajectories in computer-using environments. Building on this pipeline, we construct a large-scale dataset within the WorkArena benchmark and train computer-using agents for action sequence prediction. Through extensive experiments, we show that OS-Catalyst enables up to 50\% faster task completion on office-related benchmarks without sacrificing success rate.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3176
Loading