Foundations and Recent Trends in Multimodal Mobile Agents: A Survey

Foundations and Recent Trends in Multimodal Mobile Agents: A Survey

ACL ARR 2024 December Submission1023 Authors

15 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mobile agents are essential for automating tasks in complex and dynamic mobile environments. As foundation models evolve, they offer increasingly powerful capabilities for understanding and generating natural language, enabling real-time adaptation and processing of multimodal data. This survey provides a comprehensive review of mobile agent technologies, with a focus on recent advancements in foundation models. Our analysis begins by exploring key representative works in mobile benchmarks and interactive environments, aiming to fully understand the research focuses and their limitations. We then introduce the core components and categorize these advancements into two main approaches: prompt-based methods, which utilize large language models (LLMs) for instruction-based task execution, and training-based methods, which fine-tune multimodal models for mobile-specific applications. By discussing key challenges and outlining future research directions, this survey offers valuable insights for advancing mobile agent technologies.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Mobile Agents, Multimodal, Survey

Contribution Types: Surveys

Languages Studied: English

Submission Number: 1023

Loading