OS-MAP: How Far Can Computer Use Agents Go in Breadth and Depth?

Published: 08 Jun 2025, Last Modified: 30 Jun 2025WCUA 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Submission Track: Paper Track (up to 8 pages)
Keywords: Benchmark; Computer Use Agents; Multimodal Agents; GUI Agents
TL;DR: OS-MAP is the first benchmark in realistic dynamic computer environments to comprehensively evaluate agents’ computer use capabilities across two orthogonal dimensions: performance and generality.
Abstract: Computer use agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands—hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-Map, a benchmark for daily computer use automation, consisting of 416 applications and 15 realistic tasks. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-Map evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance–generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even the strongest agents struggle with higher-level tasks involving perception, reasoning, and coordination—highlighting the need for deeper understanding of current strengths and limitations to drive the future progress in computer use agents research and deployment. All code, environments, baselines, and data are publicly available at https://anonymous.4open.science/r/OSMap-C2F5/.
Submission Number: 17
Loading