Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

Renos Zabounidis; Aditya Golatkar; Michael Kleinman; Alessandro Achille; Wei Xia; Stefano Soatto

Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

Renos Zabounidis, Aditya Golatkar, Michael Kleinman, Alessandro Achille, Wei Xia, Stefano Soatto

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward forecasting, Pandora's Box Theory, Model Selection. Early Stopping, Test-Time Scaling

TL;DR: We propose a method for forecasting the future expected rewards as a function of future thinking tokens, and show its application for Pandora's Box Greedy Search, Early Stopping, Model Selection and Test-Time Scaling.

Abstract: We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) *early stopping* of unpromising reasoning chains, reducing compute by 26\% while maintaining accuracy, 2) *optimized model and thinking length selection* that achieves 4\% higher accuracy at equal compute and 55\% less compute at equal accuracy compared to the largest model, 3) *adaptive test-time scaling*, which increases accuracy by 11\% in high compute regime, and 7\% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.

Submission Number: 149

Loading