Optimizing Test-Time Compute via Meta Reinforcement Finetuning

Yuxiao Qu; Matthew Y. R. Yang; Amrith Setlur; Lewis Tunstall; Edward Emanuel Beeching; Ruslan Salakhutdinov; Aviral Kumar

Optimizing Test-Time Compute via Meta Reinforcement Finetuning

Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, Aviral Kumar

Published: 06 Mar 2025, Last Modified: 06 Mar 2025ICLR 2025 FM-Wild WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: meta-RL, test-time compute, information gain, self-correction, LLM optimization, cumulative regret, backtracking

TL;DR: The paper proposes MRT, a meta-RL method that optimizes LLM test-time compute using rewards and information gain.

Abstract: Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute from the lens of exploration and exploitation. It also motivates the use of cumulative regret to measure the efficacy of test-time compute by viewing a long output stream as consisting of several episodes from the model. While current state-of-the-art models do not optimize regret, we show that regret can be minimized by running final 0/1 reward RL regularized by a dense reward bonus, given by the "information gain" from each subsequent block in the output stream. We prescribe an approach for quantifying information gain, which measures the utility of an intermediate segment of tokens towards improving accuracy of the final answer. We instantiate this idea to develop MRT, a new class of finetuning methods for optimizing test-time compute. Fine-tuning with MRT leads to substantial improvements in both performance and token efficiency on the AIME dataset.

Submission Number: 16

Loading