MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Xuehui Wang; Zhenyu Wu; JingJing Xie; Zichen Ding; Bowen Yang; Zehao Li; Zhaoyang Liu; Qingyun Li; Xuan Dong; Zhe Chen; Weiyun Wang; Xiangyu Zhao; Jixuan Chen; Haodong Duan; Tianbao Xie; Chenyu Yang; Shiqian Su; Yue Yu; Yanting Zhang; Xiangyu Yue; Weijie Su; Xizhou Zhu; Wei Shen; Jifeng Dai; Wenhai Wang

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

08 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: GUI Agent, Benchmark, Computer Use

TL;DR: A hierarchical benchmark for evaluating GUI automation agents across six platforms.

Abstract: We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web. The benchmark spans four levels: Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. To assess both effectiveness and efficiency, we further propose the Efficiency–Quality-Aware (EQA) metric, which measures task success alongside action redundancy. Extensive evaluations reveal that precise visual grounding is the critical determinant of performance, underscoring the advantages of modular designs with specialized grounding modules. Moreover, all agents suffer from substantial inefficiencies, frequently completing tasks with excessive steps despite eventual success. Performance also degrades on complex or cross-application tasks, exposing weaknesses in memory, planning, and adaptive reasoning. By providing broad coverage, standardized protocols, and novel metrics, MMBench-GUI establishes the first comprehensive foundation for advancing GUI agent research.

Primary Area: datasets and benchmarks

Submission Number: 3113

Loading

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yanting Zhang, Xiangyu Yue et al. (5 additional authors not shown)