PERCU: Benchmarking Multimodal Agents on Personalized Computer Use Tasks

PERCU: Benchmarking Multimodal Agents on Personalized Computer Use Tasks

ACL ARR 2026 January Submission9667 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Personalization, multimodal agents, computer use, benchmark

Abstract: Large language model (LLM) agents have demonstrated remarkable potential in automating digital workflows through multimodal planning and reasoning. However, providing truly personalized assistance in computer use remains a significant challenge, as existing benchmarks predominantly treat agents as context-independent executors, operating under a homogeneity assumption that ignores the diverse user-specific habits and procedural routines. To address these challenges, we introduce PERCU, a benchmark designed to evaluate the personalized capabilities of multimodal agents on computer use tasks. PERCU employs a dual-instruction paradigm where agents must ingest personalized knowledge from semantically explicit first instructions and subsequently utilize the resulting memory to resolve ambiguous second instructions. Extensive evaluation of several multimodal agents on PERCU reveals significant deficiencies in their ability to serve as personalized computer assistants. Further quantitative analysis using PERCU provides valuable insights for future research in developing personalized multimodal agents. Our code and data will be available at https://github.com/scm62519/PERCU.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: LLM agents, multi-modal agents, agent evaluation

Contribution Types: Data resources

Languages Studied: English

Submission Number: 9667

Loading