Info-GRPO: Training Reasoning Models via Correlation-Aware Exploration

ICLR 2026 Conference Submission16388 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reasoning model, Group Relative Policy Optimization, Mutual Information
Abstract: Recent studies have revealed policy collapse in advanced reasoning models trained with Group Relative Policy Optimization (GRPO), and entropy regularization has stood out as an elegant approach to promote exploration. Yet, within the vast token space of language models, entropy gradients often exhibit severe singularities, creating a direct conflict with the natural entropy decay required for convergence and thereby disturbing optimization dynamics. To resolve this tension, we present Info-GRPO, an information-theoretic framework that reconciles the opposing entropic forces of exploration and convergence by cultivating correlation between the policy and a latent prior. Info-GRPO leverages a contrastive regularization that maximizes the mutual information between latent variables and the policy. Intuitively, by augmenting prompts with latent variables, the model explores a more diverse set of policies that remain correlated with the latent prior, guiding conditional entropy toward convergence. Through this correlation-aware design, Info-GRPO respects the natural entropy reduction during training while enabling more effective exploration. Extensive experiments demonstrate that Info-GRPO significantly outperforms vanilla GRPO and entropy-regularized GRPO across diverse reasoning benchmarks. For instance, it achieves improvements of 3.75\%, 1.66\%, and 4.16\% in Avg@8 compared to GRPO based on Qwen2.5-Math-7B, Qwen2.5-7B, and DeepSeek-R1-Distill-Qwen-7B, respectively, under the AIME24 benchmark. Furthermore, analysis reveals that Info-GRPO induces distinct and interpretable reasoning patterns conditioned on the latent variable, showcasing a more systematic and effective exploration strategy.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16388
Loading