Optimizing Test-Time Compute via Meta Reinforcement Finetuning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper introduces a new paradigm to optimize test-time compute in LLMs
Abstract: Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute from the lens of exploration and exploitation. It also motivates the use of cumulative regret to measure the efficacy of test-time compute by viewing a long output stream as consisting of several episodes from the model. While current state-of-the-art models do not optimize regret, we show that regret can be minimized by running final 0/1 reward RL regularized by a dense reward bonus, given by the "information gain" from each subsequent block in the output stream. We prescribe an approach for quantifying information gain, which measures the utility of an intermediate segment of tokens towards improving accuracy of the final answer. We instantiate this idea to develop MRT, a new class of finetuning methods for optimizing test-time compute. Fine-tuning with MRT leads to substantial improvements in both performance and token efficiency on the AIME dataset.
Lay Summary: Modern AI models like DeepSeek-R1 often generate long responses to solve complex problems. But more thinking doesn’t always mean better answers — sometimes, it’s just wasted effort. How can we train LLM to use its thinking time more wisely? In this work, we treat each step of the model’s reasoning as a decision — similar to how a person might weigh whether to keep working on a problem or stop. We developed a new training method called Meta Reinforcement Fine-Tuning (MRT) that encourages models to make meaningful progress at every step, not just aim for a final correct answer. We tested this idea on challenging math problems and found that MRT-trained models solve problems using fewer words and with higher accuracy. In fact, they often perform better by thinking less, as long as each step is purposeful. This research offers a new way to train LLMs to reason more efficiently, which could improve their performance in real-world applications without requiring more computational resources.
Primary Area: Deep Learning->Large Language Models
Keywords: meta-RL, test-time compute, information gain, self-correction, LLM optimization, cumulative regret, backtracking
Link To Code: https://github.com/CMU-AIRe/MRT
Submission Number: 12976
Loading