Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Published: 26 Jan 2026, Last Modified: 02 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, LLMs, Post-Training, Reasoning, theorem proving, Lean, f-divergences, Amari $\alpha$-divergences, Distributional Matching, diversity
TL;DR: We propose using a family of divergences that span mode seeking to mode covering to balance between precision and diversity in training LLMs for reasoning tasks
Abstract: Reinforcement Learning (RL) has become the _de facto_ standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the "mode-seeking" or "zero-forcing" _Reverse KL_ to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the $\alpha$-divergence family, which unifies prior approaches and enables direct control of the precision–diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage–precision Pareto frontier, outperforming all prior methods on the coverage axis.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9791
Loading