Submission Track: Paper Track (up to 8 pages)
Keywords: computer use, browser use, agents, evaluations
TL;DR: WebGames is 150 web-based, simple-to-run, ground-truth verifiable tasks to evaluate computer-using agents. We benchmark leading VLMs on it.
Abstract: We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 150 interactive challenges.
These challenges assess AI agents' ability to interact with the web as humans do, evaluating them across five core domains: Technical Fluency, Real-Time Responsiveness, Adversarial Resistance, Cognitive Abilities, and Visual Comprehension—through simple systems and fundamental browser tasks.
Our framework eliminates reliance on outside systems and provides verifiable ground-truth solutions, ensuring reproducible evaluation.
We evaluate leading vision-language models including GPT-4o, Claude, Gemini-2.5, and Qwen2.5-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 48% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems' ability to handle common web interaction patterns that humans find intuitive.
The benchmark is publicly available at https://webgames.convergence.ai.
Supplementary Material: zip
Submission Number: 40
Loading