TL;DR: A one-line change in REINFORCE bridges the gap between RL and maximum likelihood, and makes AI math reasoning up to 20× more efficient
Abstract: Reinforcement learning (RL) is the method of choice for training models in setups where the objective function can only be evaluated by sampling from the model. Our key observation is that when the feedback is terminal and binary, models implicitly induce a likelihood over correct rollouts. Maximum likelihood would be the natural framework in such settings, but RL is used instead as a workaround to the non-differentiability. We prove that the standard, expected-reward RL formulation is only a first-order approximation of the likelihood. To remedy this mismatch, we introduce **Maximum Likelihood Reinforcement Learning (MaxRL)**, a compute-indexed family of sample-based objectives that interpolate between expected-reward RL and maximum likelihood as sampling compute is scaled. The resulting objective is a one-line change to standard RL implementations. MaxRL Pareto-dominates existing methods in all tested models and tasks, achieves up to $\mathbf{20\times}$ gains in test-time scaling efficiency over GRPO, and scales more favorably with additional training data and compute.
Lay Summary: Most AI systems learn by adjusting themselves to make the correct answer as likely as possible. A photo labeled "cat" tells the system exactly what it should have said, and the system tunes itself toward that answer. This simple principle underpins much of modern machine learning.
But some problems do not hand you the full answer to tune toward. When an AI system must find its own route through a maze or work through a math problem, the useful intermediate steps are not available as labels; often, we can only check whether the finished attempt is right or wrong. Without a target path to copy, the field usually falls back on a different method, called **reinforcement learning**: let the model try, reward what works. We show that this fallback quietly settles for less: **by chasing the average rate of success, it leans on the easy problems and gives too little attention to the hard ones, where most of the learning lies**.
Our method, MaxRL, keeps the focus on problems the model rarely gets right. We show that focusing learning on the hardest examples (e.g., hardest topics in the syllabus) can lead to a more capable model in the end compared to chasing only higher average performance on all tasks. Across maze navigation, image recognition, and mathematical reasoning, MaxRL solves more problems than standard methods and reaches the same accuracy with far less computation when answers could be automatically checked (up to 20x more efficient compared to standard methods in the field). We also find MaxRL to improve more given more data and training resources. Our results suggest that MaxRL is a promising training framework for tasks where correctness can be verified but the path to a correct answer must be discovered through sampling.
Link To Code: https://github.com/tajwarfahim/maxrl
Primary Area: Deep Learning->Large Language Models
Keywords: Maximum Likelihood, Reinforcement Learning, Large Language Models, Reasoning, Diversity
Originally Submitted PDF: pdf
Submission Number: 17973
Loading