Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

ACL ARR 2025 May Submission197 Authors

08 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated promising capabilities in acting within mobile environments. However, current approaches face significant limitations: (1) they fall short in addressing real-world human needs, which involve complex, open-ended, and reasoning-intensive tasks; and (2) they lack mechanisms to learn and improve from prior experiences. To address these challenges, we introduce Mobile-Agent-E, a hierarchical agentic framework capable of self-evolution through past experience. Mobile-Agent-E adopts a multi-level communication protocol for reasoning, perception, and error recovery, explicitly separating high-level planning from low-level action decisions. It also introduces a novel self-evolution module that maintains a persistent long-term memory comprising Tips and Shortcuts, enabling continual refinement of task performance and efficiency. To bridge the gap in existing benchmarks for complex, open-ended tasks, we further present a new benchmark—Mobile-Eval-E—alongside a new evaluation metric, the Satisfaction Score. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three LMM backbones. We also provide a comprehensive analysis of the impact of the self-evolution mechanism and outline promising directions for future work.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal application, multimodality
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 197
Loading