Keywords: Benchmark, AI Agent, Smartphone Control
Abstract: Smartphone agents are increasingly important for helping users control devices
efficiently, with (Multimodal) Large Language Model (MLLM)-based agents
emerging as key contenders. Fairly comparing these agents is essential but chal-
lenging, requiring a diverse task scope, the integration of agents with different im-
plementations, and a generalisable evaluation pipeline to assess their strengths and
weaknesses. In this paper, we present SPA-BENCH, a comprehensive SmartPhone
Agent Benchmark designed to evaluate (M)LLM-based agents in an end-to-end
setting. SPA-BENCH offers three key contributions: (1) A diverse set of tasks
covering system and third-party apps in both English and Chinese, focusing on
features used in daily routines; (2) A plug-and-play framework enabling real-time
agent interaction with Android devices, integrating over 10 agents with the flex-
ibility to add more, regardless of their underlying models or how they interact
with the environment; (3) A novel evaluation pipeline that assesses agent perfor-
mance across multiple dimensions, using coarse-to-fine success detection along-
side completion- and consumption-related metrics. Our extensive experiments
across tasks and agents reveal challenges like interpreting mobile user interfaces,
action grounding, memory retention, and resource consumption. We propose
future research directions to ease these difficulties, moving closer to real-world
smartphone agent applications.
Submission Number: 69
Loading