TL;DR: How to deal with different levels of preferences in RLHF: use quantitative labels.
Abstract: The canonical setup of learning a reward model (RM) from human preferences with binary feedback discards potentially useful samples (such as "tied" between the two responses) and loses fine-grained information (such as "slightly better'"). This paper proposes a framework for learning RMs under *ordinal feedback*, generalizing the binary feedback to arbitrary granularity. We first identify a marginal unbiasedness condition, which generalizes the existing assumption of the binary feedback. The condition is validated via the sociological concept called "wisdom of the crowd". Under this condition, we develop a natural probability model and prove the benefits of fine-grained feedback in terms of reducing the Rademacher complexity, which may be of independent interest to another problem: the bias-variance trade-off in knowledge distillation. The framework also sheds light on designing guidelines for human annotators. Our numerical experiments validate that: (1) fine-grained feedback leads to better RM learning for both in- and out-of-distribution settings; (2) incorporating a certain proportion of tied samples boosts RM learning.
Lay Summary: Learning from human preferences is crucial to aligning the large language models with human values. The human preference data is usually collected in pairwise comparisons by asking human annotators to express how they prefer one over the other in a pair of responses generated by LLMs. This preference is called the binary feedback. However, more fine-grained feedback can also be collected by asking the annotators how much they prefer the chosen one. For example, different levels of "better": "significantly better", "better", and "slightly better", have been adopted in data collection. We aim to answer two questions: (1) How should we incorporate the fine-grained feedback into the current LLM training system? (2) What are the benefits of the fine-grained feedback? The first question is answered by turning the qualitative descriptions into quantitative descriptions: for example, turning "slightly better" into "60% of preference". As a well-known social experiment in *Vox Populi* has shown, the crowd's average quantitative description can be very accurate. The second question is answered by statistical learning theory, showing that more fine-grained feedback helps reduce the noise introduced by the feedback system design. For example, if you are facing a tied comparison but you are only given two options, "better" and "worse", then no matter which you choose, there is unnecessary noise in your feedback. But if you are given a third option named "tied", you can give accurate feedback now.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Large Language Models
Keywords: reward modeling, ordinal feedback, human preference learning
Submission Number: 5447
Loading