MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

Guangyi Liu; Pengxiang Zhao; Yaozhen Liang; Qinyi Luo; Shunye Tang; Yuxiang Chai; Weifeng Lin; Han Xiao; WenHao Wang; Siheng Chen; Zhengxi Lu; Gao Wu; Hao Wang; Liang Liu; Yong Liu

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, Zhengxi Lu, Gao Wu, Hao Wang, Liang Liu, Yong Liu

10 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: agent, mllm, memory, nlp

Abstract: Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8\% memory-related tasks and no cross-session learning evaluation. We introduce \textbf{MemGUI-Bench}, the the most comprehensive, memory-centric benchmark with pass@k and a staged LLM-as-judge evaluator. Our contributions include: (1) a systematic memory taxonomy with analysis of 11 prominent agents; (2) 128 tasks across 26 applications where 89.8\% challenge memory through cross-temporal and cross-spatial information retention; (3) \textbf{MemGUI-Eval}, an automated evaluation pipeline with novel \textit{Progressive Scrutiny} and 7 hierarchical metrics for memory fidelity and learning effectiveness; and (4) comprehensive assessment revealing significant memory deficits across all evaluated agents. Our experiments expose 4-10× performance gaps between memory-intensive and standard tasks, demonstrate the potential of explicit long-term memory mechanisms, and identify 7 distinct failure modes through systematic analysis. MemGUI-Bench establishes crucial empirical baselines for developing more capable and human-like GUI agents. Code and results: \url{https://anonymous.4open.science/r/MemGUI-Bench-Anonymous}.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3716

Loading