CAP: A Scalable Benchmark for Evaluating Cross-Site Browser Agents with Complex Actions and Perception
Keywords: Browser Agents, Agent-as-a-Judge, Benchmark
TL;DR: CAP: A Scalable Benchmark for Evaluating Cross-Site Browser Agents with Complex Actions and Perception
Abstract: Large language models are increasingly deployed as autonomous agents that interact with the web through browsers. While recent progress has been driven by benchmarks that evaluate end-to-end task success, these evaluations largely overlook two fundamental sources of difficulty in real web browsing: complex actions over rich user interfaces and visual perception of dynamically rendered content, especially in workflows that span multiple websites. We introduce CAP, a scalable benchmark for evaluating browser agents on cross-site, human-like web tasks that require non-trivial UI interactions and visual understanding. Specifically, we adopt decomposition-recomposition pipeline to first abstracts each website into a structured site card, capturing user-facing functions, complex execution operations, and perceptual requirement, and then recomposes these components into realistic cross-site workflows. Therefore, each task is grounded into specific operation in each website, enabling fine-grained diagnosis. Build on this framework, we successfully construct 420 tasks across 108 real-world websites and 24 domains, with carefully quality control. Experiments on state-of-the-art browser agents show low success rates using our newly proposed verifiable agent-as-a-judge evaluation framework and reveal that perception-heavy interactions remain a major bottleneck, highlighting substantial gaps between current agents and real-world web browsing demands.
Submission Number: 155
Loading