Keywords: GUI-Agent, out-of-distribution, Multimodal Large Language Models, Large Language Models
TL;DR: Instead of just memorizing, our GUI agent learns how to reason with external knowledge to solve tasks in new environments.
Abstract: Graphical User Interface (GUI) agents demonstrate significant potential in cross-application tasks, yet their performance often drops sharply when facing out-of-distribution (OOD) scenarios (e.g., unseen task, different layout, etc.) in the open world.
Previous methods, modular agent frameworks and end-to-end native agents, are designed based on in-distribution (ID) mobile data, whether through manual designed modules or specially collected training sets, while neglecting the adaptability to diverse data in potential OOD mobile scenarios.
To overcome these limitations, we propose Dynamic Knowledge Reasoning Fine-tune (**DKRF**), a paradigm that shifts the agent's core capability from memorizing ID patterns to reasoning dynamically with external knowledge.
During training, the model *explicitly* receives dynamic knowledge (e.g., *trajectories of similar tasks* or *reusable meta-functions*) and need to *incorporate* this knowledge in its reasoning chain, thereby learning to make knowledge-driven decisions.
Based on DKRF, 1) we train an end-to-end native agent, **DKR-GUI**, and 2) further propose a modular agent framework, **MA-DKR**, which uses DKR-GUI as the planning core combined with knowledge retrieval and an executing agent to achieve collaboration between complex reasoning and precise execution.
Experiments on multiple mobile benchmarks show that both DKR-GUI and MA-DKR significantly outperform existing methods, achieving an average 9.2\% improvement in success rate in OOD mobile scenarios while also maintaining state-of-the-art performance in ID mobile tasks.
Our results demonstrate that dynamic knowledge reasoning provides a general and effective solution for OOD generalization, highlighting its potential as a foundation for robust, knowledge-driven interactive agents.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5197
Loading