CS598 JY2 Final Survey Report - Multimodal Web Agents

CS598 JY2 Final Survey Report - Multimodal Web Agents

UIUC Spring 2025 CS598 LLM Agent Workshop Submission3 Authors

16 Apr 2025 (modified: 18 Apr 2025)UIUC Spring 2025 CS598 LLM Agent Workshop SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Web Agents

Abstract: In recent years, the rapid advancement of multimodal large language models (MLLMs) has driven significant progress in developing autonomous agents capable of interacting with dynamic web environments. Multimodal web agents represent a promising research direction, leveraging MLLMs to process and integrate diverse inputs—such as text and images—to perform complex web-based tasks, including navigation, information retrieval, and plan execution. This paper provides a comprehensive survey of related works, encompassing MLLMs, web agent benchmarks, and multimodal web agent algorithms. Specifically, we begin by outlining the foundational background of these agents, with a focus on MLLMs and web agent. We then introduce benchmark datasets and evaluation metrics designed to assess performance in real-world web interactions. Furthermore, we systematically review and summarize current multimodal web agent algorithms, categorizing them into prompting-based and learning-based approaches. We also discuss key techniques utilized in multimodal web agent development, such as grounding, trajectory data curation, multi-stage fine-tuning, and reinforcement learning. This review aims to provide a comprehensive foundation for researchers and practitioners working at the intersection of MLLMs, human-computer interaction, and web agents.

Submission Number: 3

Loading