WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

Atsuyuki Miyai; Zaiying Zhao; Kazuki Egashira; Atsuki Sato; Tatsumi Sunada; Shota Onohara; Hiromasa Yamanishi; Mashiro Toyooka; Kunato Nishina; Ryoma Maeda; Kiyoharu Aizawa; Toshihiko Yamasaki

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, Kiyoharu Aizawa, Toshihiko Yamasaki

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, web browsing agent

TL;DR: We propose WebChoreArena, a benchmark of 532 complex and tedious web tasks. State-of-the-art LLM agents show notable performance drops, highlighting their limitations beyond general browsing.

Abstract: Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: $\textit{Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves?}$ In this paper, we introduce \textbf{WebChoreArena}, a new fully reproducible benchmark comprising 532 carefully curated tasks over 300+ hours, designed to address more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) $\textbf{Massive Memory}$ tasks requiring accurate retrieval of large amounts of information in the observations, (ii) $\textbf{Calculation}$ tasks demanding precise mathematical reasoning, and (iii) $\textbf{Long-Term Memory}$ tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, significant performance improvements are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with GPT-5, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 4726

Loading