GUITAR: Unmasking the Competence Illusion in GUI Agent Evaluation via State-Transition Representation
Keywords: GUI Agent, Diagnosis, Analysis
Abstract: Recent advancements in Vision-Language Models have spurred interest in Graphical User Interface (GUI) agents.
However, current evaluation relying on step accuracy rate harbors a critical blind spot: it conflates frequency with competence.
In long-tailed GUI tasks, repetitive actions and screens dominate trajectories, inducing a Competence Illusion: a frequency-driven evaluation bias in which performance on high-frequency screens disproportionately shapes the measured score.
Decoupling performance from frequency is essential, but presents a challenge as raw screens lack structural frequency-agnostic abstract or aggregation.
To bridge this gap, we propose a state-centric schema and introduce GUITAR, a diagnostic framework that represents execution trajectories as State Transition Graphs and enables multi-granular diagnosis of failures.
Through extensive experiments, we show that execution failures are highly concentrated on a small number of states and transitions, including rare but critical transitions that remain invisible under conventional evaluation.
Our findings highlight fundamental limitations of frequency-weighted metrics and demonstrate the necessity of state- and transition-level diagnostics for faithful evaluation of GUI agents.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: multi-modal agents,agent evaluation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 4390
Loading