Keywords: Large Language Models, preference learning, reinforcement learning, sentence embeddings
TL;DR: Improving preference learning with interpretable ratings.
Abstract: Current Large Language Model (LLM) preference learning methods such as Proximal Policy Optimization and Direct Preference Optimization rely on direct rankings or numerical ratings of model outputs as a way to learn human preferences. These rankings are subjective, and a single numerical rating chosen directly by a judge is a poor metric to quantify a complex system such as human language. This paper introduces the What Is Missing (WIM) rating system to create better rankings for preference learning methods. WIM is a straightforward method that can be integrated into existing training pipelines, combined with other rating techniques, and used as the input to any preference learning method without changes. To create a WIM rating, natural language feedback for a model output is given by a human or LLM judge. Both the output and the feedback are passed through a sentence embedding model and the cosine similarity between the high dimensional vectors is calculated. Theoretical benefits in the distribution of WIM ratings, compared to numerical ratings, translate into lower loss throughout training, better reward advantage scaling, and better performance in a trained task. Importantly, WIM is interpretable as the reason for the chosen ranking can be discovered easily. WIM provides an alternate way to think about preference learning by shifting the focus away from the algorithms themselves and onto the improvement of the preference data generation pipeline
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18345
Loading