UpSkill: Mutual Information Skill learning for Structured Response Diversity in LLMs

ICLR 2026 Conference Submission15349 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Models, Mutual Information, Reasoning, Response Diversity, RLVR
TL;DR: We introduce a reinforcement learning approach that uses mutual information rewards to encourage response diversity in LLMs, improving multi-attempt success (pass@k,consensus@k) by training models to produce diverse yet accurate reasoning strategies.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, often by maximizing *pass@1* correctness. However, optimizing single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts *Mutual Information Skill Learning* (MISL) to LLMs to induce *structured response diversity*: a discrete latent $z$ selects a reproducible ``strategy" that steers the token distribution toward distinct modes. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a *token-level* mutual information (MI) reward that encourages trajectory specificity to $z$. Experiments on GSM8K with three open-weight models, Llama 3.1--8B, Qwen 2.5-7B, and R1-Distilled Qwen2.5-Math-1.5B show that UpSkill improves multi-attempt metrics, yielding median gains of $\sim$4\% in pass@k and $\sim$7\% in consensus@k without degrading pass@1. Additionally, we show theoretically that mutual information is closely tied to pass@k, providing a theoretical justification for UpSkill.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15349
Loading