DyPO: Dynamic Policy Optimization for Multi-Turn Interactive Reasoning

Xiao Feng; Bo Han; Zhanke Zhou; Jiaqi Fan; Jiangchao Yao; Ka Ho Li; Dahai Yu; Michael Ng

DyPO: Dynamic Policy Optimization for Multi-Turn Interactive Reasoning

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Ng

Published: 14 Jun 2025, Last Modified: 19 Jul 2025ICML 2025 Workshop PRALEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models; Interactive Reasoning

TL;DR: We propose a reinforcement learning post-training approach that enhances the multi-turn reasoning capabilities of large language models in dynamic environments.

Track: Long Paper (up to 9 pages)

Abstract: Existing on-policy reinforcement learning methods, such as Group Relative Policy Optimization (GRPO) and its variants, have enhanced the reasoning capabilities of large language models (LLMs). However, these methods often rely on static, pre-trained knowledge to navigate partially observed contexts, limiting their effectiveness in dynamic and evolving environments. In such settings, LLMs must actively interact with the environment to gather critical information, necessitating further advancements in adaptive reasoning strategies. To mitigate this gap, we introduce $\textbf{Dy}$namic $\textbf{P}$olicy $\textbf{O}$ptimization (DyPO), which extends GRPO for multi-turn optimization in dynamic environments. In principle, DyPO guarantees the shifting of reasoning pattern from static to dynamic multi-turn reasoning and stablize the training process involving environmental information. DyPO incorporates four key innovations: (1) distinct thinking and action tokens that integrate real-time environmental feedback during rollouts, (2) removal of divergence regularization for dynamic reasoning transition, (3) masked intermediate observations with simplified advantage estimation for enhanced stability, and (4) auxiliary resampling with rejection sampling to mitigate over-generation noise. These enhancements enable DyPO to achieve adaptive alignment with multi-turn interactive reasoning. Evaluations on challenging simulated benchmarks, ALFWorld and WebShop, using two instantiations of DyPO with Qwen-2.5-3B-Instruct consistently demonstrate substantial improvements in both interactive decision-making and reasoning capabilities compared to existing approaches.

Format: We have read the camera-ready instructions, and our paper is formatted with the provided template.

De-Anonymization: This submission has been de-anonymized.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 7

Loading