Axiomatic Preference Modeling for Longform Question Answering

Corby Rosset; Guoqing Zheng; Victor Dibia; Ahmed Hassan Awadallah; Paul N. Bennett

Axiomatic Preference Modeling for Longform Question Answering

Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Hassan Awadallah, Paul N. Bennett

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Language Modeling and Analysis of Language Models

Submission Track 2: Question Answering

Keywords: reward modeling, preference modeling, RLHF, Large Language Models, long form question answering

TL;DR: This paper presents an axiomatic framework for generating preference pairs and training a standalone preference model to improve alignment when choosing among candidate answers from a large language model

Abstract: The remarkable abilities of large language models (LLMs) like ChatGPT and GPT-4 partially stem from the post-training processes involving human preferences encoded within a reward model as part of a Reinforcement Learning from Human Feedback (RLHF) regimen. These reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for the scoring answers to longform questions. Our approach yields a \textbf{Preference Model} with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We intend to release our axiomatic data and model.

Submission Number: 2820

Loading