Maximum Likelihood Reinforcement Learning

Fahim Tajwar; Guanning Zeng; Yueer Zhou; Yuda Song; Daman Arora; Yiding Jiang; Jeff Schneider; Ruslan Salakhutdinov; Haiwen Feng; Andrea Zanette

Maximum Likelihood Reinforcement Learning

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, Andrea Zanette

Published: 03 Mar 2026, Last Modified: 16 Apr 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Maximum Likelihood, Reinforcement Learning, Large Language Models, Reasoning, Diversity

TL;DR: A framework to optimize maximum likelihood using reinforcement learning.

Abstract: Reinforcement learning is the method of choice to train models in *sampling-based* setups with binary outcome feedback, such as navigation, code generation, and mathematical problem solving. In such settings, models implicitly induce a likelihood over correct rollouts. However, we observe that reinforcement learning does not maximize this likelihood, and instead optimizes only a lower-order approximation. Inspired by this observation, we introduce **Maximum Likelihood Reinforcement Learning (MaxRL)**, a sampling-based framework to approximate maximum likelihood using reinforcement learning techniques. MaxRL addresses the challenges of non-differentiable sampling by defining a compute-indexed family of sample-based objectives that interpolate between standard reinforcement learning and exact maximum likelihood as additional sampling compute is allocated. The resulting objectives admit a simple, unbiased policy-gradient estimator and converge to maximum likelihood optimization in the infinite-compute limit. Empirically, we show that MaxRL Pareto-dominates existing methods in all models and tasks we tested, achieving up to $\mathbf{20\times}$ test-time scaling efficiency gains compared to its GRPO-trained counterpart. We also observe MaxRL to scale better with additional data and compute. Our results suggest MaxRL is a promising framework for scaling RL training in correctness based settings.

Submission Number: 118

Loading