Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative RLHF

Published: 02 Mar 2026, Last Modified: 10 Apr 2026LLA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-turn RL, RLHF, conversational agents, language models, chatbots, reinforcement learning
TL;DR: We reduce multi-turn RL to a sequence of single-turn RLHF problems using a learned Q-function as the reward, yielding Iterative GRPO.
Abstract: Training large language models (LLMs) as multi-turn conversational agents remains a significant challenge, particularly in goal-oriented settings. The difficulty stems from sparse, long-horizon objectives and the discrepancy between response-level planning and token-level generation. In this paper, we present a formal reduction of the multi-turn RL problem into a *sequence of single-turn RLHF-style problems.* This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RLHF problem with standard token-level approach (e.g., GRPO or PPO) is equivalent to an approximate policy improvement step within the multi-turn problem. This insight naturally leads to Iterative GRPO, a batch online approximate policy iteration algorithm that alternates between collecting a batch of data from the current policy, fitting Q-functions from these logged conversation trajectories, and improving the policy via single-turn RLHF. A major practical advantage is that Iterative GRPO directly leverages stable, off-the-shelf single-turn RLHF tools, making it straightforward to implement. Empirically, we demonstrate the effectiveness of Iterative GRPO on new multi-turn conversational environments inspired by sales-oriented agent-customer interactions.
Submission Number: 209
Loading