Foundations and Recent Trends in Multimodal Mobile Agents: A Survey

ACL ARR 2024 December Submission1023 Authors

15 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Mobile agents are essential for automating tasks in complex and dynamic mobile environments. As foundation models evolve, they offer increasingly powerful capabilities for understanding and generating natural language, enabling real-time adaptation and processing of multimodal data. This survey provides a comprehensive review of mobile agent technologies, with a focus on recent advancements in foundation models. Our analysis begins by exploring key representative works in mobile benchmarks and interactive environments, aiming to fully understand the research focuses and their limitations. We then introduce the core components and categorize these advancements into two main approaches: prompt-based methods, which utilize large language models (LLMs) for instruction-based task execution, and training-based methods, which fine-tune multimodal models for mobile-specific applications. By discussing key challenges and outlining future research directions, this survey offers valuable insights for advancing mobile agent technologies.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Mobile Agents, Multimodal, Survey
Contribution Types: Surveys
Languages Studied: English
Submission Number: 1023
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview