Keywords: Multi-modal Large Language Models; GUI Agents;
TL;DR: PrecogUI recast GUI agents from reactive to foresight-driven decision-makers that anticipate disturbances and close the loop on errors, yielding state-of-the-art robustness and success on long-horizon, dynamic tasks.
Abstract: Existing reactive Graphical User Interface (GUI) agents often fail in long-horizon, dynamic scenarios where unexpected disturbances trigger attention hijacking and cascading failures. To address this, we propose \textbf{Precog-UI}, an agent based on a pre-cognitive architecture that shifts the paradigm from reactive execution to proactive decision-making. Specifically, we first design a Proactive Experience Pool (PEP) which caches frequently occurring anomaly and success patterns as "state-action-result" tuples within a graph structure, forming a composable prior memory. Furthermore, we introduce a Proactive Modelling Executor (PME) that learns a predictive foresight model to forecast the next symbolic UI layout following a candidate action, enabling the avoidance of potential anomalies and the evaluation of policy success rates. Finally, a Pre-cognitive Execution Controller (PEC) fuses these priors and predictions, prioritises handling of foreseen anomalies, and ensures execution robustness through a closed-loop error correction mechanism. For robust evaluation, we developed an automatic engine, AutoTraj, to construct InterfereBench, a benchmark for long-horizon tasks with strong disturbances. Experiments demonstrate that PRECOG-UI surpasses existing state-of-the-art methods on InterfereBench while maintaining competitive performance on public benchmarks. The code and models will be publicly available.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 14806
Loading