Keywords: process reward model, calibration, inference-time-scaling, instance-adaptive scaling
TL;DR: We present that off-the-shelf PRMs are often poorly calibrated. To this end, we introduce a quantile-regression calibration that aligns their outputs with success probabilities. We show calibration unlocks instance-adaptive inference-time scaling.
Abstract: Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs).
We prove that, with well-calibrated PRMs, cost-optimal reasoning becomes enabled.
To this end, we introduce an \emph{instance-adaptive scaling} (IAS) framework that dynamically adjusts the inference budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer.
Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach successfully adapts to each instance and reasoning step when using our calibrated PRMs.
A central challenge, however, is that PRMs (which guide inference-time scaling) are often poorly calibrated and tend to overestimate success probabilities.
To address this, we present a calibration approach—performed via quantile regression—that adjusts PRM outputs to better align with true success probabilities.
Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method successfully achieves small calibration error, outperforming baseline methods, (ii) calibration is crucial for enabling effective adaptive scaling, and (iii) the IAS strategy achieves efficient reasoning by lowering compute usage without sacrificing accuracy.
Submission Number: 85
Loading