CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning

Yung-Chen Tang; Pin-Yu Chen; Andrea Cavallaro

CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning

Yung-Chen Tang, Pin-Yu Chen, Andrea Cavallaro

04 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: test-time scaling, calibration, LLM

Abstract: Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. However, popular methods like Best-of-$N$ sampling often show diminishing returns as $N$ increases. To address this inefficiency, we introduce a general $\textbf{test-time calibration framework}$ that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose $\textbf{CarBoN}$ (Calibrated Best-of-$N$), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature $T$ and additive shift vector $\delta$, guiding generation toward more reliable reasoning. Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to $4\times$ fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of $T$ and $\delta$ in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 2033

Loading