PPTArena: A Benchmark for Computer-Use Agents on PowerPoint Tasks

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: computer-use agents, benchmark, rubric, llm-as-a-judge, slide editing, powerpoint
TL;DR: We introduce PPTArena, a benchmark featuring 120 diverse PowerPoint tasks and robust grading scripts for evaluating computer-use agents, spanning both presentation editing and content creation scenarios.
Abstract: Creating and editing slides is a rich, multimodal activity that is ubiquitous in professional and educational settings, making it an ideal testbed for real-world computer-use agents. Microsoft PowerPoint is among the most widely adopted and feature-rich environments for presentation creation. We introduce PPTArena, a benchmark of 120 diverse PowerPoint tasks across 12 files that cover both content creation and presentation editing scenarios, organized by difficulty. A central challenge in this domain is evaluation: tasks are complex, multimodal, and often admit many valid solutions. Moreover, today’s agents frequently make only partial progress, which binary success metrics fail to capture. To address this, we design a robust evaluation framework to help create task-specific rubrics for PowerPoint tasks, taking inspiration from and building on past works for rubric-based evaluation. These rubrics award partial credit for intermediate steps, penalize unnecessary changes and poor aesthetics, and provide natural language feedback. This nuanced approach proves highly effective, achieving a Kendall's $\tau_b$ correlation of 0.77 with human judgments. We find that existing frontier agents struggle significantly, with leading proprietary models like Claude-Sonnet-4 achieving only a 43\% success rate and an average partial score of 60\%. We release PPTArena together with this evaluation framework, along with an analysis of common failure modes and insights for rubric design.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 15519
Loading