Keywords: mobile agent, safety, benchmark
Abstract: With the recent development of large language models (LLMs) and vision language models (VLMs), mobile task automation agents have made significant progress in completing user tasks by interacting with mobile applications. However, existing task automation datasets primarily focus on evaluating action prediction accuracy, offering little insight into the safety risks posed by agent generated actions. To address this gap, we introduce MobileGuard, the first benchmark to evaluate safety in mobile task automation. We formalize mobile automation safety through the notion of unsafe transitions: agent actions that may result in irreversible loss, unintended modification, or external broadcast of user data. We curated MobileGuard from real-world mobile states across seven popular applications, resulting in 1,953 manually reviewed actions and 269 labeled unsafe transitions. To enable scalable agent evaluation, we develop an emulator platform compatible with diverse mobile applications. Our evaluation shows that state-of-the-art mobile automation agents often fail to identify unsafe actions. While techniques such as few-shot prompting and fine-tuning offer some safety improvements, they remain inadequate for real-world deployment. Overall, MobileGuard provides a systematic framework for evaluating mobile automation safety and encourages future work toward developing safety-aware mobile task automation agents.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13256
Loading