Accelerating Best-of-N via Speculative Rejection

Published: 18 Jun 2024, Last Modified: 18 Jun 2024WANT@ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: alignment, large language models, rejection sampling, best-of-n, acceleration
TL;DR: We accelerate the Best-of-N algorithm, an inference-time alignment strategy
Abstract: The safe and effective deployment of Large Language Models (LLMs) often involves generating helpful and benign responses, producing easily comprehensible code, and crafting content with specific stylistic preferences. While different, these tasks share the common mathematical goal of generating responses from a language model with high scores according to a metric of interest. A popular and well known decoding strategy for this purpose is the Best-of-N method. The method generates a pre-specified number of responses (N) based on a prompt, and then selects the highest-scoring response among them to be returned. While Best-of-N is both simple and effective, its reliance on generating multiple responses to score for any given prompt incurs high inference costs. In this paper we make a first step towards accelerating the Best-of-N algorithm, by halting the generation of unpromising utterances, namely those that are unlikely to be returned by the algorithm upon completion. Focusing on the alignment problem, we show that this simple strategy allows to obtain substantial speedups for the Best-of-N algorithm with minimal performance degradation.
Submission Number: 51
Loading