Keywords: Mobile GUI Agent, Dataset, Benchmark
TL;DR: We introduce GAMBIT, a graph-structured benchmark for decision-aware mobile GUI agents, which reveals severe performance drops in long-horizon and branching tasks, providing a challenging diagnostic testbed for future agent development.
Abstract: Mobile GUI agents powered by LMMs can perceive screens and follow instructions, yet existing benchmarks largely target short, linear workflows and step-level accuracy, offering limited insight into long-horizon planning and branching tasks. We present GAMBIT, a graph-structured, decision-aware benchmark comprising 830 task episodes and 11,345 actions across 35 applications on Android and iOS. Tasks are organized into Sequential, Conjunctive, Conditional, and Hierarchical workflows with dual-level annotations, capturing realistic multi-step and branching scenarios. To move beyond step metrics, we introduce weighted longest common subsequence for length-sensitive progress and decision accuracy for branch correctness. Evaluations on 7 diverse agents show that GAMBIT induces a substantial accuracy drop compared to prior datasets, with success rates falling below 5% on 6–8 step tasks and branch accuracy averaging 38%, underscoring weaknesses in conditional reasoning. By systematically exposing these failure modes, GAMBIT provides a challenging, diagnostic testbed for advancing decision-aware mobile GUI agents. Our code and dataset are available at: https://anonymous.4open.science/r/GAMBIT-40BB/.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 22064
Loading