CUA-Suite: Expert Trajectories and Pixel-Precise Grounding for Computer-use Agents

Published: 02 Mar 2026, Last Modified: 05 Mar 2026LLA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Computer-use Agents, GUI Agents, Human Demonstration, Evaluation, Training Dataset
Abstract: Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet current models remain brittle on professional applications due to the scarcity of high-quality, richly annotated human demonstration data. We introduce CUA-Suite, a unified benchmark and training corpus that addresses the full stack of computer-use intelligence across 87 diverse desktop applications. The CUA-Suite ecosystem integrates two independent efforts: UI-Vision, a rigorous benchmark for evaluating element grounding, layout understanding, and action prediction, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 5 million element annotations used to train GroundNext vision-language models. Together, these resources address what agents should perceive and where they should act, but not how humans actually carry out multi-step workflows over time. To close this gap, this paper introduces ActCUA, the unifying component of the suite, which provides approximately 10,000 expert-demonstrated tasks with continuous video recordings, kinematic cursor traces, and multi-layered reasoning trajectory annotations averaging 497 words per step. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. By bridging static grounding, planning evaluation, and dynamic trajectory data within a single coherent resource, CUA-Suite provides dense, causal supervision in which every element on screen is labeled and every action is logged. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications. Beyond benchmarking, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, and visual world models. All data, benchmarks, and models are publicly released.
Submission Number: 169
Loading