MobileWizard: A Data-Efficient GUI Agent with Structured Reasoning and Progressive Reinforcement Learning

Weifeng Lin; Yuxiang Chai; Han Xiao; Liuyang Bian; Guangyi Liu; Liang Liu; Shuai Ren; Penggang Shi; Yafei Wen; Xiaoxin Chen; Aojun Zhou; Hongsheng Li

MobileWizard: A Data-Efficient GUI Agent with Structured Reasoning and Progressive Reinforcement Learning

Weifeng Lin, Yuxiang Chai, Han Xiao, Liuyang Bian, Guangyi Liu, Liang Liu, Shuai Ren, Penggang Shi, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li

05 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: GUI Agent; VLM

Abstract: This paper introduces MobileWizard, a data-efficient framework designed to enhance the reasoning and precision of mobile GUI agents. Trained on merely 24.5k public trajectories and 300 remedial trajectories, MobileWizard-7B demonstrates exceptional performance, achieving a 47.2\% success rate on AndroidWorld, outperforming prominent larger open-source models like UI-TARS-72B. This high efficiency stems from two core innovations: 1) Structured Reasoning: A new structured Chain-of-Thought (CoT) paradigm that decomposes the agent’s reasoning process into four explicit and interpretable modules: self-verification, screen analysis, planning, and action guidance. The proposed CoT guides the LLM to achieve logical consistency, extraction of key insights, and provides clear paths for failure analysis. 2) Progressive Reinforcement Learning: We propose a comprehensive RL strategy that features four key components: efficient cold-start training, a dynamic reward system with Progressive Reward Shrinking to boost precision, History Self-Alignment to narrow the training-inference gap, and a Corrective Teaching Pipeline for self-improvement from online failures. The experimental results demonstrate that our framework enables superior generalization from limited data. We believe that our method presents a scalable and efficient path toward building more robust and versatile GUI agents.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2250

Loading