Keywords: GUI Agent
Abstract: The proliferation of Large Language Models (LLMs) and Vision-Language Models (VLMs) has driven the development of general-purpose agents for Graphical User Interface (GUI) automation.
Despite this progress, the practical application of these agents is hindered by their fragility, which stems from three primary limitations: low retrieval accuracy in retrieval-augmented generation (RAG), over-reliance on single-modality perception, and inadequate failure recovery mechanisms.
To address these challenges, we introduce \textbf{VistaGUI}, a robust, multi-modal GUI agent that integrates optimized retrieval, adaptive sensing, and environment-aware state management into a unified framework.
The core contributions of VistaGUI are threefold. First, a parallel instruction-understanding module enhances retrieval accuracy to better comprehend user intent, enabling more precise, context-aware decision-making.
Second, an adaptive multi-modal sensing module dynamically selects the optimal perception modality—including API-based queries, visual perception, and OCR—to achieve a comprehensive understanding of diverse GUI environments.
Third, an environment-aware state management system records and analyzes interaction trajectories to proactively detect and efficiently recover from execution failures, thereby reducing replanning overhead.
VistaGUI is implemented within a modular architecture comprising a Knowledge Manager, Planner, Action Executor, and History Context Recorder.
Extensive experiments conducted on a diverse set of GUI automation tasks demonstrate that VistaGUI significantly outperforms strong baselines in task success rate, recovery speed, and overall robustness.
Supplementary Material: pdf
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 22623
Loading