Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

ACL ARR 2025 February Submission5867 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and complete daily tasks. However, existing online benchmarks struggle with result replication due to dynamic environmental changes, while offline benchmarks, with their single-trajectory annotations, force the agents to follow the preferences of the annotator, limiting their reflections to complete tasks through multiple paths. Additionally, both types of benchmarks fail to assess whether agents can handle noise or engage in proactive interactions due to a lack of noise and overly full instructions. To address these limitations, we construct a more realistic and comprehensive multimodal offline benchmark named Mobile-Bench-v2, which includes a common task split with multi-path evaluations, a Noisy-APP split with pop-ups and ads, a contaminated split AITZ-Noise based on AITZ, and an ambiguous instruction split with preset Q\&A interactions. We evaluate agent frameworks with large-scale VLMs on the common split using both single- and multi-path evaluation and assess the supervised fine-tuning agent on AITZ-Noise. Moreover, we explore whether incorporating noise into the original training data can overcome in-domain ad contamination. Data will be released in the future.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation, agent

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English, Chinese

Submission Number: 5867

Loading