Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We analyze the Best-of-N algorithm for choosing among language model generations and demonstrate that it is suboptimal, before introducing a new, optimal algorithm motivated by the principle of pessimism in the face of uncertainty.
Abstract: Recent work on inference-time alignment has established benefits of increasing inference-time computation in language models, but naively scaling compute through techniques like Best-of-N sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we formalize inference-time alignment as improving a pre-trained policy’s responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy’s coverage over high-quality responses for performance and compute scaling: (1) We show that Best-of-N alignment with an ideal N can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when N is large, and fails to achieve tight guarantees under more realistic coverage conditions; (2) We introduce InferenceTimePessimism, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing pessimism in the face of uncertainty; we prove that its performance is optimal and scaling-monotonic, i.e., does ot degrade as N increases. We complement our theoretical results with experiments that demonstrate the practicality of our algorithm across a variety of tasks and models.
Lay Summary: Inference-time methods for language models alter and improve the model’s outputs at generation time, and have experienced great empirical success. One popular approach, called Best‐of‐N sampling, generates N candidates and returns the one with the highest score under a learned model. However, as N grows, reward model errors accumulate and output quality worsens, meaning that Best-of-N is unable to access the full improvement in response quality available at inference time. To solve this problem, we first built a clean mathematical framework for inference‐time alignment, and analyzed exactly how and why Best‐of‐N sampling breaks down. Then, we designed a new algorithm, InferenceTimePessimism, that deliberately penalizes highly-rewarded responses with high uncertainty, and utilizes essentially the same amount of computation as Best-of-N. Crucially, our method separates how much computation we spend from how strongly we penalize uncertainty, so adding more compute never hurts quality. We proved that this approach achieves the best possible trade‐off between reward‐model error and computational cost and showed in experiments that it reliably improves accuracy without the performance dips seen in standard Best‐of‐N sampling. In this way, our research offers both a theoretical roadmap and a practical tool for making large language models more robust and reliable at inference time.
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: inference time alignment, best of n, pessimism, reinforcement learning theory, offline reinforcement learning
Submission Number: 9722
Loading