GUITAR: Unmasking the Competence Illusion in GUI Agent Evaluation via State-Transition Representation

ACL ARR 2026 January Submission4390 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI Agent, Diagnosis, Analysis
Abstract: Recent advancements in Vision-Language Models have spurred interest in Graphical User Interface (GUI) agents. However, current evaluation relying on step accuracy rate harbors a critical blind spot: it conflates frequency with competence. In long-tailed GUI tasks, repetitive actions and screens dominate trajectories, inducing a Competence Illusion: a frequency-driven evaluation bias in which performance on high-frequency screens disproportionately shapes the measured score. Decoupling performance from frequency is essential, but presents a challenge as raw screens lack structural frequency-agnostic abstract or aggregation. To bridge this gap, we propose a state-centric schema and introduce GUITAR, a diagnostic framework that represents execution trajectories as State Transition Graphs and enables multi-granular diagnosis of failures. Through extensive experiments, we show that execution failures are highly concentrated on a small number of states and transitions, including rare but critical transitions that remain invisible under conventional evaluation. Our findings highlight fundamental limitations of frequency-weighted metrics and demonstrate the necessity of state- and transition-level diagnostics for faithful evaluation of GUI agents.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: multi-modal agents,agent evaluation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 4390
Loading