TRACE: Coarse-to-Fine Automated Evaluation of Mobile Agents with Safety Considerations in Realistic Environments
Keywords: Mobile Agents, Benchmark, Vision Language Models
Abstract: The online evaluation of mobile agents is becoming increasingly important for both accurately assessing agent capabilities and providing reward signals for online reinforcement learning. Evaluating mobile agents on complex multi-step tasks remains challenging, as existing work suffers from limitations in reliability and generality, while overlooking issues of environmental realism and operational safety. This paper introduces TRACE (TRajectory-based Automated Coarse-to-fine Evaluation), a fully automated vision language model (VLM)-based method designed to evaluate arbitrary mobile agents across diverse environments. TRACE evaluates agent trajectories in a two-stage manner, first through step-wise assessment and then through overall judgment, which significantly reduces evaluation difficulty and enhances reliability. Potentially risky or harmful operations are also detected simultaneously during the step-wise assessment. Furthermore, we construct TRACEBench, a scalable benchmark consisting of 187 tasks from 35 commonly used mobile applications, to better reflect the actual performance of agents in realistic online environments. Task design explicitly considers operational safety, and evaluation metrics cover three key dimensions: task completion, safety, and resource consumption. Experiments show that TRACE achieves an F1 score of 0.836 with the open-sourced Qwen2.5-VL-72B-Instruct, indicating high precision as well as better usability and cost-effectiveness. Extensive evaluation of 8 representative mobile agents on TRACEBench reveals that current mobile agents still have substantial room for improvement, particularly in terms of task completion and operational safety.
Primary Area: datasets and benchmarks
Submission Number: 3338
Loading