WebGames: Challenging General-Purpose Web-Browsing AI Agents

George Thomas; Filippos Christianos; Alex James Chan; Rohit Midha; Jikun Kang; Wenqi Wu; Fraser David Greenlee; Andrew Toulis; Marvin Purtorab

WebGames: Challenging General-Purpose Web-Browsing AI Agents

George Thomas, Filippos Christianos, Alex James Chan, Rohit Midha, Jikun Kang, Wenqi Wu, Fraser David Greenlee, Andrew Toulis, Marvin Purtorab

Published: 08 Jun 2025, Last Modified: 27 Jun 2025WCUA 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Paper Track (up to 8 pages)

Keywords: computer use, browser use, agents, evaluations

TL;DR: WebGames is 150 web-based, simple-to-run, ground-truth verifiable tasks to evaluate computer-using agents. We benchmark leading VLMs on it.

Abstract: We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 150 interactive challenges. These challenges assess AI agents' ability to interact with the web as humans do, evaluating them across five core domains: Technical Fluency, Real-Time Responsiveness, Adversarial Resistance, Cognitive Abilities, and Visual Comprehension—through simple systems and fundamental browser tasks. Our framework eliminates reliance on outside systems and provides verifiable ground-truth solutions, ensuring reproducible evaluation. We evaluate leading vision-language models including GPT-4o, Claude, Gemini-2.5, and Qwen2.5-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 48% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems' ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at https://webgames.convergence.ai.

Supplementary Material: zip

Submission Number: 40

Loading