Asymptotics of Language Model Alignment

Published: 01 Jan 2024, Last Modified: 30 Nov 2024ISIT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Let $\boldsymbol{p}$ denote a reference generative language model. Let $\boldsymbol{r}$ denote a reward model that returns a scalar to capture the degree at which a draw from $\boldsymbol{p}$ is preferred. The goal of language model alignment is to alter $\boldsymbol{p}$ to a new distribution $\phi$ that results in a higher expected reward while keeping $\phi$ close to $\boldsymbol{p}$ . A popular alignment method is the KL-constrained reinforcement learning $(\boldsymbol{RL})$ , which chooses a distribution $\Phi_\Delta$ that maximizes $E_{\phi_{\Delta}}\boldsymbol{r}(\boldsymbol{y})$ subject to a relative entropy constraint $D_{\mathrm{K}\mathrm{L}}(\phi_{\Delta}\Vert \boldsymbol{p})\leq\Delta$ . Another simple alignment method is best-of-N, where $N$ samples are drawn from $\boldsymbol{p}$ and one with highest reward is selected. In this paper, we offer a closed-form characterization of the optimal KL-constrained RL solution. We then demonstrate that any alignment method that achieves a comparable trade-off between KL divergence and expected reward must approximate the optimal KL-constrained RL solution in terms of relative entropy. To analyze the properties of alignment methods, we introduce two simplifying assumptions: we let the language model be memoryless, and the reward model be linear. Although these assumptions may not reflect complex real-world scenarios, they enable a precise characterization of the asymptotic (in the sequence length) behavior of the best-of-N and the KL-constrained RL methods, in terms of information-theoretic quantities. 1
Loading