Keywords: reward learning, distillation, preference optimization, large language models
TL;DR: Introduces ratio matching framework for reward distillation
Abstract: While Direct Preference Optimization (DPO) revolutionized language model alignment by eliminating the need for explicit reward models and reinforcement learning, scenarios with access to high-quality reward models (RMs) trained on extensive preference datasets still benefit from leveraging these resources. Reward model distillation techniques such as REBEL have emerged as part of a class of approaches that do not require the added complexity of reinforcement learning. In this paper, we derive REBEL through a ratio matching framework and show its relation to existing preference optimization methods, unifying the different approaches to reward distillation within the broader preference optimization landscape. Empirical evaluation on the HH-RLHF dataset with Pythia 2.8B in offline settings shows that REBEL achieves twice the reward margin of DPO, demonstrating the advantages of incorporating explicit reward signals when available.
Submission Number: 69
Loading