Keywords: Computer-use Agents, GUI Agents, Human Demonstration, Evaluation, Training Dataset
Abstract: Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents that directly addresses this bottleneck. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations averaging 497 words per step, totaling approximately 55 hours and 6 million frames of expert video, more than 2.5 times the largest existing open dataset. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.5 million UI element annotations. Together, these resources provide dense, causal supervision in which every element on screen is labeled and every action is logged. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~ 60% task failure rate). Beyond benchmarking, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models will be publicly released.
Submission Number: 169
Loading