Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward forecasting, Pandora's Box Theory, Model Selection. Early Stopping, Test-Time Scaling
TL;DR: We propose a method for forecasting the future expected rewards as a function of future thinking tokens, and show its application for Pandora's Box Greedy Search, Early Stopping, Model Selection and Test-Time Scaling.
Abstract: We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) *early stopping* of unpromising reasoning chains, reducing compute by 26\% while maintaining accuracy, 2) *optimized model and thinking length selection* that achieves 4\% higher accuracy at equal compute and 55\% less compute at equal accuracy compared to the largest model, 3) *adaptive test-time scaling*, which increases accuracy by 11\% in high compute regime, and 7\% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.
Submission Number: 149
Loading