Token-level Advantage Policy Optimization from Negative Feedback in Multi-Turn Agents

Published: 19 Dec 2025, Last Modified: 05 Jan 2026AAMAS 2026 FullEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-turn agents, Token-level Policy optimization, Entropy-guided Learning
TL;DR: We propose TAPO, a novel token-level perspective reinforcement learning optimization algorithm for enhancing multi-turn agent reasoning.
Abstract: Training multi-turn agents for complex tasks is challenged by sparse rewards. Existing methods are inefficient: they either learn exclusively from successes, discarding valuable failure data, or require rigid win-loss pairs, limiting data utilization. We propose Token-level Advantage Policy Optimization (TAPO), a flexible, pair-free method that leverages all trajectories. TAPO translates a trajectory's terminal reward into token-level advantages, effectively reinforcing the entire sequence of actions in successful trajectories while penalizing those in failed ones. Furthermore, TAPO concentrates updates on high-entropy tokens, which represent pivotal moments of model uncertainty and are thus crucial for efficient exploration and policy improvement. As a post-training optimization, TAPO substantially boosts a baseline SFT agent's average score from 74.2 to 89.4 (+20.5\% relative) on three challenging multi-turn benchmarks, outperforming RFT and DPO-style baselines and demonstrating consistent gains in both seen and unseen settings. The anonymous code is available at https://anonymous.4open.science/r/TAPO-in-AAMAS2025/.
Area: Generative and Agentic AI (GAAI)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 715
Loading