GAMBIT: A Graph-structured and Decision-Aware Benchmark for MoBile GUI Tasks

GAMBIT: A Graph-structured and Decision-Aware Benchmark for MoBile GUI Tasks

ICLR 2026 Conference Submission22064 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mobile GUI Agent, Dataset, Benchmark

TL;DR: We introduce GAMBIT, a graph-structured benchmark for decision-aware mobile GUI agents, which reveals severe performance drops in long-horizon and branching tasks, providing a challenging diagnostic testbed for future agent development.

Abstract: Mobile GUI agents powered by LMMs can perceive screens and follow instructions, yet existing benchmarks largely target short, linear workflows and step-level accuracy, offering limited insight into long-horizon planning and branching tasks. We present GAMBIT, a graph-structured, decision-aware benchmark comprising 830 task episodes and 11,345 actions across 35 applications on Android and iOS. Tasks are organized into Sequential, Conjunctive, Conditional, and Hierarchical workflows with dual-level annotations, capturing realistic multi-step and branching scenarios. To move beyond step metrics, we introduce weighted longest common subsequence for length-sensitive progress and decision accuracy for branch correctness. Evaluations on 7 diverse agents show that GAMBIT induces a substantial accuracy drop compared to prior datasets, with success rates falling below 5% on 6–8 step tasks and branch accuracy averaging 38%, underscoring weaknesses in conditional reasoning. By systematically exposing these failure modes, GAMBIT provides a challenging, diagnostic testbed for advancing decision-aware mobile GUI agents. Our code and dataset are available at: https://anonymous.4open.science/r/GAMBIT-40BB/.

Supplementary Material: pdf

Primary Area: datasets and benchmarks

Submission Number: 22064

Loading