Keywords: LLM; RL; Post-training;multi-turn conversation; Reasoning LLM; LRM;
Abstract: Large Reasoner Models (LRMs) excel at single-turn reasoning but often degrade in multi-turn settings due to insufficient alignment at the dialogue level. We propose Turn-level Trajectory-Clipping with Back-Propagation Optimization (TTPO), a critic-free Reinforcement Learning from Verifiable Rewards (RLVR) algorithm that extends GRPO to robust multi-turn reasoning. TTPO introduces three components: (i) a turn-level policy ratio with PPO-style clipping, treating each turn as a unified action; (ii) trajectory clipping, which prunes low-reward branches to mitigate exponential forking; and (iii) reward back-propagation, which propagates discounted terminal rewards to earlier turns for stable credit assignment. Experiments across six representative multi-turn tasks—Code, Database, Math, Actions, Data-to-Text, and Summarization—show that TTPO substantially improves mean performance while sharply reducing run-to-run volatility (U90–10) without sacrificing high-percentile quality (A90). Ablations confirm contributions from all three components, with trajectory clipping and reward back-propagation yielding the largest reliability gains. These results demonstrate that turn-level alignment offers a simple and general recipe for robust long-horizon dialogue reasoning.
Primary Area: generative models
Submission Number: 18403
Loading