Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-turn interaction, reinforcement learning, LLM Agent
Abstract: This paper investigates Reinforcement Learning (RL) approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents in long-horizon, multi-turn scenarios. Such multi-turn agentic tasks can be naturally formalized as turn-level Markov Decision Processes (MDPs). However, most existing methods adopt MDP formulations with trajectory-level rewards, either terminal rewards that provide only a final outcome signal, or delayed rewards that merge intermediate and outcome signals into a single sparse feedback, leading to poor credit assignment. To address this limitation, we reformulate these tasks as MDPs with explicit turn-level rewards and provide theoretical analysis supporting the effectiveness of this design. Building on this formulation, we extend popular RL algorithms, GRPO and PPO, to their respective multi-turn variants, enabling fine-grained credit assignment. We conduct case studies on multi-turn reasoning-augmented search agents, where we carefully design two types of turn-level rewards: verifiable and LLM-as-judge. Our experiments on multi-turn search tasks demonstrate that our proposed formulation, incorporated well-designed turn-level rewards, enables RL algorithms to significantly outperform baseline methods with trajectory-level rewards. Both training and validation reward curves illustrate that our method achieves \textit{greater stability}, \textit{faster convergence}, and \textit{higher accuracy}. Numerical results across diverse question-answering datasets further show that our approach consistently delivers highest answer correctness and 100\% format correctness.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16065
Loading