Keywords: Multimodal Web Agents
Abstract: In recent years, the rapid advancement of multimodal large language models (MLLMs) has driven significant progress in developing autonomous agents capable of interacting with dynamic web environments. Multimodal web agents represent a promising research direction, leveraging MLLMs to process and integrate diverse inputs—such as text and images—to perform complex web-based tasks, including navigation, information retrieval, and plan execution. This paper provides a comprehensive survey of related works, encompassing MLLMs, web agent benchmarks, and multimodal web agent algorithms.
Specifically, we begin by outlining the foundational background of these agents, with a focus on MLLMs and web agent. We then introduce benchmark datasets and evaluation metrics designed to assess performance in real-world web interactions. Furthermore, we systematically review and summarize current multimodal web agent algorithms, categorizing them into prompting-based and learning-based approaches. We also discuss key techniques utilized in multimodal web agent development, such as grounding, trajectory data curation, multi-stage fine-tuning, and reinforcement learning. This review aims to provide a comprehensive foundation for researchers and practitioners working at the intersection of MLLMs, human-computer interaction, and web agents.
Submission Number: 3
Loading