Keywords: Test-time Scaling, Model Calibration, Efficient inference, Language Modeling, Scaling
TL;DR: We propose Self-Calibration, a new unsupervised framework to help model calibrate the confidence and using the confidence to efficiently test time scaling.
Abstract: While increasing test-time computation with methods like Best-of-N sampling and Self-Consistency enhances the quality of Large Language Model (LLM) responses, their fixed sampling strategy is inefficient. These approaches waste computation on simple questions while insufficiently exploring complex ones. We argue that model confidence is key to improving this efficiency. However, LLMs are notoriously overconfident, making their confidence estimates unreliable. To address this, we introduce Self-Calibration, a method that distills reliable confidence scores from Self-Consistency back into the model itself, enabling accurate confidence estimation in a single forward pass. Building on this, we designed confidence-based efficient test-time scaling methods, such as incorporating an Early-Stopping mechanism for Best-of-N. Experiments across six datasets demonstrate our approach's effectiveness. Notably, on MathQA, our method improved accuracy from 81.0\% to 83.6\% with a 16-response budget, confirming the value of our strategy.
Submission Number: 81
Loading