Abstract: VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and complete daily tasks. However, existing online benchmarks struggle with result replication due to dynamic environmental changes, while offline benchmarks, with their single-trajectory annotations, force the agents to follow the preferences of the annotator, limiting their reflections to complete tasks through multiple paths. Additionally, both types of benchmarks fail to assess whether agents can handle noise or engage in proactive interactions due to a lack of noise and overly full instructions. To address these limitations, we construct a more realistic and comprehensive multimodal offline benchmark named Mobile-Bench-v2, which includes a common task split with multi-path evaluations, a Noisy-APP split with pop-ups and ads, a contaminated split AITZ-Noise based on AITZ, and an ambiguous instruction split with preset Q\&A interactions. We evaluate agent frameworks with large-scale VLMs on the common split using both single- and multi-path evaluation and assess the supervised fine-tuning agent on AITZ-Noise. Moreover, we explore whether incorporating noise into the original training data can overcome in-domain ad contamination. Data will be released in the future.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation, agent
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English, Chinese
Submission Number: 5867
Loading