Keywords: Alignment, Preference Learning, Plurality
TL;DR: A novel alignment framework to learn from heterogeneous human preferences
Abstract: Large foundation models require extensive \textit{alignment} to human preferences before deployment. Existing methods for alignment from comparison data largely assume a universal preference, neglecting the diversity of individual opinions. We introduce PAL, a personalizable reward framework that models the \emph{plurality} of human preferences via latent variables using the ideal point model, metric learning, and mixture modeling. PAL captures the \emph{plurality} of preferences while learning a common preference latent space, enabling few-shot generalization to new users. It is modular, interpretable, and flexible in incorporating complexity via data driven cross-validation. With simple multi-layer perceptron, PAL achieves competitive reward model accuracy on Summary \cite{stiennon2020learning} (language), Pick-a-Pic \cite{kirstain2024pick} (image generation), and Persona \cite{perez2022discovering} (semi-synthetic) heterogeneous preference datasets, matching state-of-the-art performance with greater efficiency. Lastly, our findings also highlight the need for more nuanced data collection to capture the heterogeneity of human preferences.
Submission Number: 30
Loading