Keywords: Multi-turn LLM RL; LLM Post-training; LLM Agent;
TL;DR: We propose MRRL, a post-training framework that enables LLMs to solve complex tasks via multi-round interactions, reflecting on environmental feedback and learning from successful trajectories to improve reasoning and efficiency.
Abstract: While Large language models (LLMs) have demonstrated strong reasoning capabilities, solving complex tasks in interactive environments requires more than single-round responses; they must learn to act strategically, reflect on feedback, and iteratively refine their responses. In this paper, we propose Multi-round Reinforcement Learning (MRRL) for LLMs, a novel post-training framework to complete tasks through strategic interactions with environments while learning to revise their responses by reflecting on and integrating the environment feedback. Beyond environment rewards, our MRRL leverages text-based feedback from the environment, which provides richer and more explicit guidance toward task completion. We formalize the LLM interaction process as a Markov Decision Process (MDP) and derive the proximal policy gradient update for MRRL. To help the LLM reflect on and leverage environmental feedback, we propose the Feedback Reflection Imitation Learning algorithm, which uses an LLM to generate feedback reflection and alleviate the distribution gap for imitation learning from successful trajectories. We conduct extensive experiments across diverse tasks, including text-based games, mathematical problems, search tasks, and logical puzzles, validating that our MRRL achieves stable multi-round RL training for LLMs in various environments. Furthermore, MRRL reduces the number of interaction rounds needed to complete tasks, unleashing the full potential of LLMs in multi-round reasoning.
Submission Number: 6
Loading