VistaGUI: Towards More Robust and Intelligent GUI Automation

VistaGUI: Towards More Robust and Intelligent GUI Automation

ICLR 2026 Conference Submission22623 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: GUI Agent

Abstract: The proliferation of Large Language Models (LLMs) and Vision-Language Models (VLMs) has driven the development of general-purpose agents for Graphical User Interface (GUI) automation. Despite this progress, the practical application of these agents is hindered by their fragility, which stems from three primary limitations: low retrieval accuracy in retrieval-augmented generation (RAG), over-reliance on single-modality perception, and inadequate failure recovery mechanisms. To address these challenges, we introduce \textbf{VistaGUI}, a robust, multi-modal GUI agent that integrates optimized retrieval, adaptive sensing, and environment-aware state management into a unified framework. The core contributions of VistaGUI are threefold. First, a parallel instruction-understanding module enhances retrieval accuracy to better comprehend user intent, enabling more precise, context-aware decision-making. Second, an adaptive multi-modal sensing module dynamically selects the optimal perception modality—including API-based queries, visual perception, and OCR—to achieve a comprehensive understanding of diverse GUI environments. Third, an environment-aware state management system records and analyzes interaction trajectories to proactively detect and efficiently recover from execution failures, thereby reducing replanning overhead. VistaGUI is implemented within a modular architecture comprising a Knowledge Manager, Planner, Action Executor, and History Context Recorder. Extensive experiments conducted on a diverse set of GUI automation tasks demonstrate that VistaGUI significantly outperforms strong baselines in task success rate, recovery speed, and overall robustness.

Supplementary Material: pdf

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 22623

Loading