Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Rogerio Bonatti; Dan Zhao; Francesco Bonacci; Dillon Dupont; Sara Abdali; Yinheng Li; Yadong Lu; Justin Wagle; Kazuhito Koishida; Arthur Bucker; Lawrence Keunho Jang; Zheng Hui

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, Zheng Hui

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A scalable, eficient, and fast open-sourced framework to test mutlimodal AI agents on PC within a full Windows OS environment

Abstract: Large language models (LLMs) show potential as computer agents, enhancing productivity and software accessibility in multi-modal tasks. However, measuring agent performance in sufficiently realistic and complex environments becomes increasingly challenging as: (i) most benchmarks are limited to specific modalities/domains (e.g., text-only, web navigation, Q&A) and (ii) full benchmark evaluations are slow (on order of magnitude of multiple hours/days) given the multi-step sequential nature of tasks. To address these challenges, we introduce Windows Agent Arena: a general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real OS to use the same applications and tools available to human users when performing tasks. We create 150+ diverse tasks across representative domains that require agentic abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized for a full benchmark evaluation in as little as $20$ minutes. Our work not only speeds up the development and evaluation cycle of multi-modal agents, but also highlights and analyzes existing shortfalls in the agentic abilities of several multimodal LLMs as agents within the Windows computing environment---with the best achieving only a 19.5\% success rate compared to a human success rate of 74.5\%.

Lay Summary: Large multi-modal language models that understand text, images, and more are becoming capable digital assistants and agents, helping us accomplish complex computer tasks on their own. However, effectively measuring how well these AI agents perform realistic tasks is challenging because traditional benchmarks can be complex and slow, often taking hours if not more to provide meaningful results. This slow evaluation significantly delays progress in AI development and agent improvements. Moreover, despite the widespread popularity and extensive use of the Windows operating system (OS), there is no agentic benchmark designed specifically for the Windows OS. To solve this critical bottleneck, we introduce the Windows Agent Arena, a highly efficient and scalable evaluation framework where state-of-the-art multi-modal AI agents perform tasks using the same software and tools humans rely on daily. Our approach dramatically accelerates evaluation, enabling rapid feedback and quicker improvements in AI capabilities. Even the best current AI agents pale in comparison to the 74.5% success rate by humans, highlighting significant room for advancement.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/microsoft/WindowsAgentArena

Primary Area: Applications

Keywords: agents, benchmark, computer agents, AI agents, multimodal agents, large language models

Submission Number: 14807

Loading