Abstract: Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis data collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.
Lay Summary: Most computer users interact with software through graphical interfaces. Creating AI agents that can automate these tasks could dramatically improve productivity, but current approaches face major limitations. Existing methods typically read the underlying code of interfaces (like HTML/AXTree) rather than seeing them visually, require different programming for each platform, and struggle with complex reasoning.
We developed AGUVIS, an AI agent that interacts with computer interfaces the same way humans do—by looking at the screen and understanding what it sees. Instead of reading code, our agent processes screenshots directly and uses a unified approach to control any device, whether it's a website, mobile app, or desktop program. Crucially, we taught the agent to "think out loud" through inner monologue, allowing it to plan multi-step tasks and adapt to new situations rather than just reacting automatically.
Our approach achieves the best performance on standard benchmarks while being significantly more efficient than previous methods. This represents the first fully autonomous visual interface agent that works without depending on proprietary AI systems. By making our datasets, models, and training methods publicly available, we're providing a foundation that could accelerate the development of AI assistants capable of automating routine computer tasks across any platform.
Link To Code: https://github.com/xlang-ai/aguvis
Primary Area: Applications->Everything Else
Keywords: GUI Agent, Visual Language Model, Large Language Model, Grounding, Reasoning, Planning, Computer Use Agent, Vision Language Action Model
Submission Number: 5811
Loading