Distributional Reinforcement Learning for Large Language Models

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Distributional Reinforcement Learning, Actor–Critic Methods
Abstract: Actor-critic reinforcement learning for large language models (LLMs) typically relies on a scalar value function, discarding crucial information about potential returns. We propose a distributional actor-critic framework that learns the full distribution of returns to guide exploration more effectively. We find that in deterministic reasoning tasks, the spread of this learned distribution directly measures the model's confidence in its own value estimates. Our method harnesses this signal through an optimistic exploration bonus derived from the distribution's upper-tail variance, guiding the policy toward promising yet uncertain reasoning paths. This uncertainty-guided exploration promotes the discovery of diverse correct solutions, leading to substantial gains in pass@k across challenging benchmarks. This result demonstrates a significant enhancement of the model's exploration effectiveness over strong baselines, which is complemented by consistent, albeit more modest, improvements in single-answer correctness.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23390
Loading