MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents, Computer-Use Agents, Personalized Agents
Abstract: Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment as personal assistants are expected to work across a user’s whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live-web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MYPCBENCH, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six frontier and open-weight models under each provider’s native CUA agent. Claude Opus 4.6 reaches 55.4%, the only model above 50%. Current agent failure modes focus on tasks that span many applications and long trajectories, where personalization stresses an assistant the most. We release the environment, the task set with rubrics, the agent harness, and the rubric-grading judge at [TBD]
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 129
Loading