Structured Response Diversity with Mutual Information

Devan Shah; Owen Yang; Daniel Yang; Chongyi Zheng; Benjamin Eysenbach

Structured Response Diversity with Mutual Information

Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach

Published: 28 Sept 2025, Last Modified: 09 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Models, Mutual Information, Reasoning, Response Diversity, RLVR

TL;DR: We introduce a reinforcement learning approach that uses mutual information rewards to encourage response diversity in LLMs, improving multi-attempt success (pass@k,consensus@k) by training models to produce diverse yet accurate reasoning strategies.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has greatly improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, often by maximizing pass@1 correctness. However, optimizing single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We adapt *Mutual Information Skill Learning* (MISL) to LLMs and develop training-time rewards that induce *structured response diversity*: a discrete latent $z$ selects a reproducible "strategy" that steers the token distribution toward distinct modes. We propose two complementary rewards for Group Relative Policy Optimization (GRPO): a *token-level* mutual information (MI) reward that encourages trajectory specificity to $z$, and a *semantic* MI reward that encourages separation in an embedding space. Experiments on GSM8K with three open-weight models, Llama-3.1-8B, Qwen-2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, with 2,000 training problems show that token-level MISL improves multi-attempt metrics, yielding median gains of $\sim$4\% in pass@k and $\sim$12\% in consensus@k without degrading pass@1. We further outline a theoretical connection that shows that improvement in pass@k is upper-bounded linearly by the mutual information. We discuss practical considerations, the instability of the semantic MI estimator, and open directions.

Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.

Submission Number: 152

Loading