OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Reyna Abhyankar; Qi Qi; Yiying Zhang

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Reyna Abhyankar, Qi Qi, Yiying Zhang

Published: 08 Jun 2025, Last Modified: 01 Jul 2025WCUA 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Paper Track (up to 8 pages)

Keywords: Computer-use agents, LLM Benchmarks, ML Systems

TL;DR: We introduce a benchmark that measures the efficiency of computer-use agents on OSWorld.

Abstract: Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g. tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning and reflection account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct **OSWorld-Human**, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using **OSWorld-Human** and found that even the best agents take 1.4-2.7x more steps than necessary.

Camera Ready Modification Summary: Added open source link, validated numbers for tables, clarify notation for formulas

Submission Number: 45

Loading