MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

10 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: agent, mllm, memory, nlp
Abstract: Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8\% memory-related tasks and no cross-session learning evaluation. We introduce \textbf{MemGUI-Bench}, the the most comprehensive, memory-centric benchmark with pass@k and a staged LLM-as-judge evaluator. Our contributions include: (1) a systematic memory taxonomy with analysis of 11 prominent agents; (2) 128 tasks across 26 applications where 89.8\% challenge memory through cross-temporal and cross-spatial information retention; (3) \textbf{MemGUI-Eval}, an automated evaluation pipeline with novel \textit{Progressive Scrutiny} and 7 hierarchical metrics for memory fidelity and learning effectiveness; and (4) comprehensive assessment revealing significant memory deficits across all evaluated agents. Our experiments expose 4-10× performance gaps between memory-intensive and standard tasks, demonstrate the potential of explicit long-term memory mechanisms, and identify 7 distinct failure modes through systematic analysis. MemGUI-Bench establishes crucial empirical baselines for developing more capable and human-like GUI agents. Code and results: \url{https://anonymous.4open.science/r/MemGUI-Bench-Anonymous}.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3716
Loading