Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Yiming Wang; Pei Zhang; Siyuan Huang; Baosong Yang; Zhuosheng Zhang; Fei Huang; Rui Wang

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, Rui Wang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Test-Time Scaling, Best-of-N Sampling, Efficient Decoding, Self-Estimation

TL;DR: We propose the ST-BoN sampling decoding, which avoids fully generating and reward models, and can reduce GPU memory overhead and inference latency while achieve better cost-performance trade-offs.

Abstract: Test-time scaling enhances large language model performance by allocating additional compute resources during decoding. Best-of-$N$ (BoN) sampling serves as a common sampling-based scaling technique, broadening the search space in parallel to find better solutions from the model distribution. However, its cost–performance trade-off is still underexplored. Two main challenges limit the efficiency of BoN sampling: (1) Generating $N$ full samples consumes substantial GPU memory, reducing inference capacity under limited resources. (2) Reward models add extra memory and latency overhead, and training strong reward models introduces potential training data costs. Although some studies have explored efficiency improvements, none have addressed both challenges at once. To address this gap, we propose **Self-Truncation Best-of-$N$ (ST-BoN)**, a decoding method that avoids fully generating all $N$ samples and eliminates the need for reward models. It leverages early sampling consistency in the model’s internal states to identify the most promising path and truncate suboptimal ones. In terms of cost, ST-BoN reduces dynamic GPU memory usage by over 80% and inference latency by 50%. In terms of cost–performance trade-off, ST-BoN achieves the same performance as Full-BoN while saving computational cost by 70%–80%, and under the same cost, it can improve accuracy by 3–4 points.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 7777

Loading