WebCanvas: Benchmarking Web Agents in Online Environments

Published: 18 Jun 2024, Last Modified: 16 Jul 2024Agentic Markets @ ICML'24 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: web automation, benchmark, LLM, language-guided agents
TL;DR: We introduce WebCanvas, an online evaluation framework for web agents designed to address the dynamic nature of web interactions.
Abstract: For web agents to be practically useful, they need to generalize to the ever changing web environment --- UI updates, page content updates, etc. Unfortunately, most traditional benchmarks only capture a static state of the web page. We introduce WebCanvas, an innovative online evaluation framework for web agents designed to address the dynamic nature of web interactions. WebCanvas contains three main components supporting realistic assessments: (1) A key-node-based evaluation metric, which stably capture critical actions or states necessary for task completions while disregarding noises caused by insignificant events or changed web-elements; (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines, which allows us to maintain the high-quality, up-to-date dataset and automatically detection shifts in live action sequences. Despite the advancements, best-performing model achieves only a 23.1% task success rate, highlighting substantial room for improvement in future work.
Submission Number: 37
Loading