Exploring the Trade-off between Quality and Diversity of Language Models during Reinforcement Learning

Xiuyuan Hu; Guoqing Liu; Yang Zhao; Jieran Li; Dongbiao Sun; José Miguel Hernández-Lobato; Hao Zhang; Xue Liu

Exploring the Trade-off between Quality and Diversity of Language Models during Reinforcement Learning

Xiuyuan Hu, Guoqing Liu, Yang Zhao, Jieran Li, Dongbiao Sun, José Miguel Hernández-Lobato, Hao Zhang, Xue Liu

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Models, Reinforcement Learning, Diversity, Entropy

Abstract: Reinforcement learning (RL) has become the dominant approach for post-training autoregressive language models, but a recurring challenge is that improvements in *quality* often come at the expense of *diversity*, which is a practical concern in exploratory domains such as scientific discovery. Although this trade-off is widely acknowledged, it has lacked a quantitative characterization. In this work, we systematically investigate the quality-diversity dynamics of RL finetuning of language models \hl{primarily} on molecular generation, a domain where diversity is both essential for discovery and quantitatively measurable. Across RL checkpoints, we observe that mean quality ($\mathcal{R}$) and diversity ($\mathcal{D}$) trace a smooth trajectory captured by a robust exponential law, $\mathcal{R}=-a\cdot\exp(c\cdot\mathcal{D})+b$, independent of step indexing. Extending prior work on quality-entropy trade-offs, we further show that quality also follows an exponential relation with sampling entropy ($\mathcal{H}$), $\mathcal{R}=-a_0\cdot\exp(c_0\cdot\mathcal{H})+b_0$, with $c_0$ quantifying exploratory progress. An approximately linear link between entropy and diversity explains why the two laws compose, and an information-theoretic illustration clarifies the role of the exponential form. We also conduct ablations on influencing factors, including model scaling, reward shaping, and training setup, validate these findings across multiple generation objectives, and extend the experiments to textual exploratory creativity tasks by large language models. Finally, we demonstrate how the fitted laws provide actionable guidance for RL finetuning of language models on exploratory tasks. Overall, our study moves beyond qualitative accounts of diversity collapse, offering a compact quantitative model, an underlying entropy-based mechanism, and practical tools for exploratory RL with language models.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 9169

Loading