Abstract: VLM-based mobile GUI agents excel in GUI interaction by employing a Chain of Action-Planning Thoughts (CoaT) paradigm, which is like System 2 CoT reasoning. Meanwhile, self-training methods are widely used to optimize the CoT process. However, the lack of diverse CoaT data restricts the agent's output space and limits its generalization ability, which is crucial for the self-training sampling stage. Multiple correct answers in the GUI field also make it challenging to train the process reward model, further hindering the optimization of the CoaT process. To address these problems, we first enhance the diversity of agents' output space through three-stage instruction evolution, then obtain high-quality positive and negative pairs at the CoaT action level using a rule-based value calculation algorithm, and leverage iterative DPO training to optimize the agents' preference between different action types. Experiments are performed on the latest CoaT dataset AITZ, long-trajectory dataset AMEX, and comprehensive dataset AndroidControl. Our agent MobileIPL achieves the SoTA results on AITZ, AMEX, and AndroidControl, while also demonstrating strong generalization performance on the out-domain subsets of AndroidControl.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: multimodal applications, GUI-Agent, reinforcement learning
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 6050
Loading