Decoding-Time Language Model Alignment with Multiple Objectives

Published: 18 Jun 2024, Last Modified: 07 Jul 2024TF2M 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-objective alignment, decoding-time algorithms, RLHF
TL;DR: We propose a training-free, simple yet effective decoding-time algorithm for multi-objective alignment of language models, with optimality guarantees.
Abstract: Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $\textbf{multi-objective decoding (MOD)}$, a decoding-time algorithm that outputs the next token from a linear combination of predictions of all base models, for any given weightings over different objectives. We exploit a common form among a family of $f$-divergence regularized alignment approaches (such as PPO, DPO, and their variants) to identify a closed-form solution by Legendre transform, and derive an efficient decoding strategy. Theoretically, we show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method. Experiments validate our claims.
Submission Number: 6
Loading