InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI Agents, MLLMs, Reinforcement Learning
Abstract: Multimodal Large Language Models (MLLMs) have shown significant promise in powering Graphical User Interface (GUI) agents to automate complex digital tasks. However, the prevailing monolithic training paradigms often create a structural mismatch with the hierarchical nature of capabilities required for robust performance. Specifically, the efficacy of methods like Reinforcement Learning (RL) is critically predicated on the agent possessing a high-quality behavioral prior of key reasoning skills, such as spatial reasoning and goal decomposition, which are often absent. To resolve this impasse, we propose Actor2Reasoner, a novel two-stage hierarchical training paradigm grounded in the principle of Endow First, Internalize Later. The first stage, Cognitive Endowment, employs targeted supervised fine-tuning to instill these crucial thinking patterns, forging a Capable Actor. Subsequently, the second stage, Policy Internalization, utilizes RL to evolve this actor into a Deliberative Reasoner by internalizing the endowed abilities into a robust, context-aware decision-making policy. We instantiate our paradigm in InfiGUI-R1, an agent that achieves state-of-the-art performance on challenging benchmarks, including AndroidControl. Our work demonstrates that decoupling the endowment of foundational abilities from the internalization of policy provides a more effective and principled path toward developing sophisticated and resilient GUI agents.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 10070
Loading