A3: Android Agent Arena For Mobile GUI Agents

Yuxiang Chai; Shunye Tang; Han Xiao; Weifeng Lin; Liang Liu; Hanhao Li; Jiayu Zhang; Pengxiang Zhao; Guangyi Liu; Rongduo Han; Guozhi Wang; Shuai Ren; Siyuan Huang; Hongsheng Li

A3: Android Agent Arena For Mobile GUI Agents

Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Liang Liu, Hanhao Li, Jiayu Zhang, Pengxiang Zhao, Guangyi Liu, Rongduo Han, Guozhi Wang, Shuai Ren, Siyuan Huang, Hongsheng Li

17 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mobile GUI agent, benchmark, reward model

Abstract: The advancement of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has catalyzed the development of autonomous AI agents. Mobile graphic user interface (GUI) agents, designed to perform tasks on mobile devices, represent a promising application of this technology. However, a significant gap persists in mobile GUI agent evaluation, where existing benchmarks predominantly rely on either static frame assessments such as AndroidControl or offline static apps such as AndroidWorld and thus fail to capture agent performance in dynamic, real-world online mobile applications. To address this gap, we present Android Agent Arena (A3), a novel evaluation system for mobile GUI agents. Unlike existing dynamic evaluation systems, A3 introduces a benchmark of 100 tasks derived from 20 widely-used, online apps across 20 distinct categories from the Google Play Store, ensuring evaluation comprehension. A3 also presents a novel "essential-state" based evaluation method that leverages MLLMs (either commercial or open-source models) as reward models to progressively verify task completion and process achievement. This automated evaluation approach significantly reduces the reliance on manual labor and coding expertise compared with traditional evaluation methods such as in AndroidWorld. Furthermore, A3 includes a toolkit and an evaluator to streamline Android device interaction and facilitate data collection from both human and agent demonstrations. The complete A3 system, including the benchmark and pipeline, will be publicly released to provide a robust foundation for future research and development in mobile GUI agents.

Primary Area: datasets and benchmarks

Submission Number: 8635

Loading