Towards Aligning Language Models with Textual Feedback

Published: 17 Jun 2024, Last Modified: 02 Jul 2024ICML 2024 Workshop MHFAIA PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Alignment, textual feedback
TL;DR: We develop an approach allows AI alignment using human feedback expressed in text.
Abstract: We present ALT (ALignment with Textual feedback), an approach that aligns models toward certain user preferences expressed in text.  We posit that text allows for an interface for users to provide richer feedback than comparative preferences. In our work, we explore the efficacy and efficiency of textual feedback across several tasks. For the task of reducing model toxicity, we show that even using rule-based feedback can reduce model toxicity 62\% more than PPO in-domain and 52\% out-of-domain.  For the task of summarization, we show that \name can match the performance of PPO with only 20\% of the training samples, both in- and out-of-domain. Finally, for the task of aligning dialog to be harmless and helpful, we find that \name can effectively use textual feedback provided by a Large Language Model without the need for a reward model.
Submission Number: 70
Loading