TL;DR: Given an inference-time procedure, such as Best-of-N, standard RLHF suffers from a train/test mismatch between train-time policy and inference-time policy. InfAlign bridges this gap via reward calibration and transformation.
Abstract: Language model alignment is a critical step
in training modern generative language models.
Alignment targets to improve win rate of a sample
from the aligned model against the base model.
Today, we are increasingly using inference-time
algorithms (e.g., Best-of-$N$ , controlled decoding, tree search) to decode from language models
rather than standard sampling. We show that this
train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time
methods. To this end, we propose a framework for
inference-aware alignment (InfAlign), which
aims to optimize *inference-time win rate* of the
aligned policy against the base model. We prove
that for any inference-time decoding procedure,
the optimal aligned policy is the solution to the
standard RLHF problem with a *transformation*
of the reward. This motivates us to provide the
calibrate-and-transform RL (InfAlign-CTRL)
algorithm to solve this problem, which involves
a reward calibration step and a KL-regularized
reward maximization step with a transformation
of the calibrated reward. For best-of-$N$ sampling
and best-of-$N$ jailbreaking, we propose specific
transformations offering up to 3-8% improvement
on inference-time win rates. Finally, we also show
that our proposed reward calibration method is a
strong baseline for optimizing standard win rate.
Lay Summary: Language model alignment is a critical step in training modern generative language models.
Alignment targets to improve win rate of a sample from the aligned model against the base model.
Today, we are increasingly using inference-time algorithms (e.g., Best-of-$N$ , controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time
methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which
aims to optimize *inference-time win rate* of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a *transformation* of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-$N$ sampling and best-of-$N$ jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.
Primary Area: Deep Learning->Large Language Models
Keywords: language model, alignment, decoding, inference time procedure, best of n
Submission Number: 7511
Loading