TL;DR: We provide guarantees on KL divergence and win-rate of the best-of-n alignment policy.
Abstract: A simple and effective method for the inference-time alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the reference policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-$n$ policy against the reference policy is upper bounded by $n/(n+1)$ and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy, which demonstrate that very good tradeoffs are achievable with $n < 1000$.
Lay Summary: A common approach for improving the output of generative models is best-of-n, i.e., to generate n candidates and then choose the one that maximizes some measure of quality. How much does this strategy change the underlying model? Researchers commonly use the heuristic that the divergence grows with the logarithm of the number of candidates. We show that this is an upper bound, we characterize the conditions under which this bound is tight, and we propose a better approximation. Finally, we bound the probability that this "best-of-n" strategy produces a better output, and analyze the tradeoff with respect to preserving the key features of the original model.
Primary Area: Theory
Keywords: Alignment, rejection sampling, KL divergence
Submission Number: 7453
Loading