Keywords: Mobile GUI Agents, UI Security, Adversarial Attacks, AgentHazard, Empirical Evaluation
TL;DR: This study presents the first systematic investigation of mobile GUI agents' vulnerabilities to on-screen content manipulated by untrustworthy third parties.
Abstract: GUI agents are designed to autonomously execute diverse device-control tasks by interpreting and interacting with device screens.
Despite notable advancements, their resilience in real-world scenarios—where screen content may be
partially manipulated by untrustworthy third parties—remains largely unexplored.
In this work, we present the first systematic investigation into the vulnerabilities of mobile GUI agents.
We introduce a scalable attack simulation framework named AgentHazard,
which enables flexible and targeted modifications of screen content within existing applications.
Leveraging this framework, we develop a comprehensive benchmark suite comprising both a dynamic task execution environment
and a static dataset of state-rule pairs.
The dynamic environment encompasses 122 reproducible tasks in an emulator with various types of hazardous UI content,
while the static dataset consists of over 3,000 attack scenarios constructed from screenshots collected from a wide range of commercial apps.
Importantly, our content modifications are designed to be feasible for unprivileged third parties.
We perform experiments on 6 widely-used mobile GUI agents and 5 common backbone models using our benchmark.
Our findings reveal that all examined agents are significantly influenced by misleading third-party contents
(with an average misleading rate of 42.1\% and 40.7\% in dynamic and static environments, respectively).
We also find that the vulnerabilities are closely linked to the perception modalities and backbone LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5419
Loading