Understanding Reasoning Collapse in LLM Agent Reinforcement Learning
Keywords: multi-turn reinforcement learning, LLM agents, PPO, GRPO, reasoning collapse, mutual information, conditional entropy, information-theoretic diagnostics, reward-variance filtering
Abstract: RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly
determines task performance. Entropy is widely used to track reasoning stability. However,
entropy only measures within-input diversity — it cannot tell whether reasoning actually responds to different inputs. We find that even with stable entropy, models can rely on
fixed templates that look diverse but are input-agnostic. We call this **template collapse**,
a failure mode invisible to entropy and all existing metrics.
To diagnose this failure, we decompose reasoning quality into within-input diversity
(Entropy) and cross-input distinguishability (Mutual Information), and introduce a family
of MI proxies for online diagnosis. Across diverse tasks, MI correlates with final
performance much more strongly than entropy, making it a more reliable proxy for reasoning
quality. We further explain template collapse with a *signal-to-noise ratio* (SNR)
mechanism. Low reward variance weakens task gradients, letting regularization terms
dominate and erase cross-input reasoning differences. To address this, we propose
**SNR-Adaptive Filtering** to select high-signal prompts per iteration using reward
variance as a lightweight proxy. Across planning, math reasoning, web navigation, and
code execution, the method consistently improves both input dependence and task performance.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 67
Loading