Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment

Published: 08 Jun 2025, Last Modified: 30 Jun 2025WCUA 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Submission Track: Paper Track (up to 8 pages)
Keywords: Reinforcement Learning, Tool-Use Agent, Credit Assignment
Abstract: This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios, which can be naturally modeled as Markov Decision Processes (MDPs). While existing approaches often train multi-turn LLM agents with trajectory-level advantage estimation in bandit settings, they struggle with turn-level credit assignment across multiple decision steps, limiting their performance on multi-turn reasoning tasks. To address this, we introduce a \textit{fine-grained turn-level advantage estimation} strategy to enable more precise credit assignment in multi-turn agent interactions. The strategy is general and can be incorporated into various RL algorithms such as Group Relative Preference Optimization (GRPO). Our experimental evaluation on multi-turn reasoning and search-based tool-use tasks with GRPO implementations highlights the effectiveness of the MDP framework and the turn-level credit assignment in advancing the multi-turn reasoning capabilities of LLM agents in complex decision-making settings. Our method achieves 100\% success in tool execution and 50\% accuracy in exact answer matching, significantly outperforming baselines, which fail to invoke tools and achieve only 20–30\% exact match accuracy.
Submission Number: 13
Loading