Personalized Language Modeling from Personalized Human Feedback

Published: 05 Mar 2024, Last Modified: 08 May 2024ICLR 2024 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning from Human Feedback, Personalization, Large Language Models
TL;DR: We propose a general Personalized-RLHF framework, which jointly learns a user model and a language (or reward) model, to align the language model behavior with individual user preferences.
Abstract: Reinforcement Learning from Human Feedback (RLHF) is the current dominating framework to fine-tune large language models to better align with human preferences. However, the underlying premise of algorithms developed under this framework can be problematic when user preferences encoded in human feedback are diverse. In this work, we aim to address this problem by developing methods for building personalized language models. We propose a general Personalized-RLHF (P-RLHF) framework, which requires one to jointly learn a user model and a language (or reward) model. We develop new learning objectives for personalized reward modeling and personalized Direct Preference Optimization. To demonstrate the efficacy of our method, we test it on real-world text summarization data with annotated preferences and annotator information. We fine-tune GPT-J 6B to obtain personalized language (and reward) models, which outperform non-personalized models in terms of aligning with individual preferences.
Submission Number: 64
Loading