Enhance Chain Of Action-Planning Reasoning Via Iterative Preference Learning

Enhance Chain Of Action-Planning Reasoning Via Iterative Preference Learning

ACL ARR 2025 February Submission6050 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: VLM-based mobile GUI agents excel in GUI interaction by employing a Chain of Action-Planning Thoughts (CoaT) paradigm, which is like System 2 CoT reasoning. Meanwhile, self-training methods are widely used to optimize the CoT process. However, the lack of diverse CoaT data restricts the agent's output space and limits its generalization ability, which is crucial for the self-training sampling stage. Multiple correct answers in the GUI field also make it challenging to train the process reward model, further hindering the optimization of the CoaT process. To address these problems, we first enhance the diversity of agents' output space through three-stage instruction evolution, then obtain high-quality positive and negative pairs at the CoaT action level using a rule-based value calculation algorithm, and leverage iterative DPO training to optimize the agents' preference between different action types. Experiments are performed on the latest CoaT dataset AITZ, long-trajectory dataset AMEX, and comprehensive dataset AndroidControl. Our agent MobileIPL achieves the SoTA results on AITZ, AMEX, and AndroidControl, while also demonstrating strong generalization performance on the out-domain subsets of AndroidControl.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: multimodal applications, GUI-Agent, reinforcement learning

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 6050

Loading