Track: long paper (up to 9 pages)
Keywords: web ai, code, web development, benchmark, visual question answering, reasoning
TL;DR: WebMMU benchmarks AI on website VQA, code editing, and sketch-to-code generation, exposing gaps in web reasoning and structured UI generation.
Abstract: Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks address isolated web-based tasks—such as website-based Visual Question Answering (VQA) and UI-to-code generation—they lack a unified evaluation suite for assessing web agents that interact with and reason about web environments. We introduce WebMMU, a large-scale benchmark for evaluating AI-driven web agents across multilingual website VQA, HTML/CSS/JavaScript code editing, and sketch-to-code generation. WebMMU provides a comprehensive evaluation suite with real-world website data, multi-step reasoning tasks, and functional UI understanding. Benchmarking state-of-the-art multimodal models on WebMMU reveals significant limitations in web-based reasoning, layout understanding, and structured code generation, particularly in preserving UI hierarchy, handling multilingual content, and producing robust, functional code. While most existing models are optimized for English-only settings, WebMMU highlights the challenges of cross-lingual adaptation in real-world web development. These findings expose critical gaps in current models’ ability to understand website structures, execute user instructions, and generate high-quality web code, underscoring the need for more advanced multimodal reasoning in AI-driven web understanding and development.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Rabiul_Awal1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 59
Loading