ScenDroid: A Scenario-Level Benchmark for Long-Horizon, Time-Evolving GUI Agents

Published: 02 Mar 2026, Last Modified: 10 Apr 2026LLA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark, VLM, GUI Agents, Long-Horizon
Abstract: Recent breakthroughs in Vision Language Models (VLMs) have empowered GUI agents to perform discrete, short-horizon tasks. However, prevailing benchmarks predominantly rely on an Atomic Reset paradigm, which treats user interactions as isolated episodes, thereby failing to capture the continuous and state-dependent nature of real-world workflows. To bridge this gap, we introduce **ScenDroid**, which employs a novel **App--Task--Scenario (ATS)** architecture to orchestrate dependency-aware workflows (exceeding **1200 steps**) across persistent Android environments spanning simulated days to weeks. Beyond operational execution, ScenDroid incorporates a **Progressive Ambiguity Taxonomy** and an integrated **Interactive User Simulator** to assess an agent’s capacity for proactive intent clarification and long-term preference alignment. Our extensive evaluation of state-of-the-art GUI agents reveals a catastrophic performance collapse, elucidating critical cognitive bottlenecks in structured episodic memory and closed-loop reasoning. We further deconstruct these failure modes and provide a strategic roadmap for developing ``Digital Agents" capable of persistent, autonomous, and human-aligned interaction. We release all scenarios, environment snapshots, agents, and evaluation data to catalyze research into these long-term challenges.
Submission Number: 135
Loading