A Unified Perspective on Reward Distillation Through Ratio Matching

Published: 10 Jun 2025, Last Modified: 06 Aug 2025MoFA PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reward learning, distillation, preference optimization, large language models
TL;DR: Introduces ratio matching framework for reward distillation
Abstract: While Direct Preference Optimization (DPO) revolutionized language model alignment by eliminating the need for explicit reward models and reinforcement learning, scenarios with access to high-quality reward models (RMs) trained on extensive preference datasets still benefit from leveraging these resources. Reward model distillation techniques such as REBEL have emerged as part of a class of approaches that do not require the added complexity of reinforcement learning. In this paper, we show that REBEL can be derived as a ratio-matching objective with respect to DPO’s optimal policy. In addition, we generalize ratio matching into distribution matching, formulating a new, principled alignment objective in the multi-completion setting where Group Relative Policy Optimization (GRPO) is commonly used.
Submission Number: 69
Loading