Keywords: Language Models, Mutual Information, Reasoning, Response Diversity, RLVR
TL;DR: We introduce a reinforcement learning approach that uses mutual information rewards to encourage response diversity in LLMs, improving multi-attempt success (pass@k,consensus@k) by training models to produce diverse yet accurate reasoning strategies.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, often by maximizing pass@1 correctness. However, optimizing single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts *Mutual Information Skill Learning* (MISL) to LLMs to induce *structured response diversity*: a discrete latent $z$ selects a reproducible "strategy" that steers the token distribution toward distinct modes. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a *token-level* mutual information (MI) reward that encourages trajectory specificity to $z$. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled Qwen2.5-Math-1.5B show that UpSkill improves multi-attempt metrics for Qwen 2.5-7B, yielding gains of $\sim$4\% in pass@k and $\sim$4\% in plurality@k without degrading pass@1. Additionally, we prove that improvements in pass@k are closely tied to the mutual information objective, providing a theoretical justification for UpSkill.
Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.
Submission Number: 152
Loading