Keywords: Vision–Language Model, World Model, Computer Use Agent
Abstract: Vision-language models have emerged as capable computer-use agents, showing increasing potential to automate a wide range of computer tasks through graphical user interfaces. However, their effectiveness remains bounded by a fundamental limitation: current LLM- or VLM-based agents struggle to generalize to unfamiliar applications and remain heavily dependent on large-scale, human-curated datasets. To address this, we introduce ScreenExplorer, a novel VLM-based agent designed for autonomous exploration in real, dynamic, open-ended GUI environments. Through end-to-end training with an exploration-driven objective, our approach enables sustained interaction and diverse discovery without relying on predefined task structures. Specifically, we introduce a world model-inspired curiosity reward that helps the agent to overcome the cold-start phase of exploration, coupled with state-change-based exploration rewards to encourage agent's intrinsic motivation for venturing into novel states. Additionally, an experience stream distillation mechanism is designed to systematically accumulate and refine exploratory policies, enabling continual learning from gathered experiences. Extensive evaluations demonstrate that ScreenExplorer achieves remarkable generalization and diverse exploration capabilities in unseen applications, significantly outperforming static deployment baselines. This work establishes a new paradigm for GUI agents to progressively learn through autonomous exploration, moving beyond static dataset dependency toward adaptive, lifelong learning in complex digital worlds.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18228
Loading