Abstract: GUI automation faces critical challenges in dynamic environments. MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge. Traditional fine-tuning methods are costly for app-specific knowledge updates. We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms: $\textbf{(1) Autonomous Exploration of Function-aware Trajectory}$. To comprehensively cover all application functionalities, we design a $\textbf{Function-aware Task Goal Generator}$ that automatically constructs exploration goals by analyzing GUI structural information (e.g., screenshots and activity hierarchies). This enables systematic exploration to collect diverse trajectories. $\textbf{(2) Unsupervised Mining of Transition-aware Knowledge}$. To establish precise screen-operation logic, we develop a $\textbf{Transition-aware Knowledge Extractor}$ that extracts effective screen-operation logic through unsupervised analysis the state transition of structured interaction triples (observation, action, outcome). This eliminates the need for human involvement in knowledge extraction. With a task success rate of 53.7\% on SPA-Bench and 47.4\% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents. It requires no parameter updates for new apps. All data and code will be publicly available on Github after acceptance.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: GUI Agent,GUI automation
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 3724
Loading