Understanding Reasoning Collapse in Multi-Turn Agent Reinforcement Learning
Keywords: multi-turn reinforcement learning, LLM agents, PPO, GRPO, reasoning collapse, mutual information, conditional entropy, information-theoretic diagnostics, reward-variance filtering
TL;DR: We model reasoning collapse in LLM agent RL with a two-axis view of diversity, propose scalable estimators, and use reward-variance prompt-group SNR filtering to prevent prompt-agnostic template drift in LLM agent RL.
Abstract: In closed-loop multi-turn agent reinforcement learning, LLM agents exhibit reasoning collapse, where reasoning shift toward generic templates, weakly coupled to the inputs. We firstly identify that such collapse is easy to miss with entropy or surface diversity metrics since reasoning text still varies but becomes input-agnostic. We then propose an information-theoretic decomposition of reasoning variable $Z$’s variation into conditional entropy $H(Z\mid X)$ (randomness under same input) and mutual information (MI) $I(X;Z)$ (input dependence). Template collapse occurs when $H(Z\mid X)$ stays high while $I(X;Z)$ drops, yielding diverse-looking but generic reasoning. To make $I(X;Z)$ a reproducible and sanity-checkable diagnostic, we further introduce an MI-style retrieval protocol treating each reasoning trace $Z$ as a query to retrieve its source $X$ from a minibatch; accuracy degrades toward chance under collapse. We thus provide a signal-to-noise ratio explanation for why $I(X;Z)$ drops: when within-input reward variance $\mathrm{Var}(R\mid X)$ is low, task gradients weaken and input-agnostic regularizers (KL, entropy) dominate, flattening cross-input differences. Finally, we propose reward-variance-aware filtering to prioritize high-signal updates. Across multi-turn environments, model scales, and modalities (including VLMs), this improves input dependence, stability, and performance while remaining competitive with state-of-the-art stabilization baselines.
Submission Number: 74
Loading