A Unified Perspective on Reward Distillation Through Ratio Matching

Kenan Hasanaliyev; Schwinn Saereesitthipitak; Rohan Sanda

A Unified Perspective on Reward Distillation Through Ratio Matching

Kenan Hasanaliyev, Schwinn Saereesitthipitak, Rohan Sanda

Published: 10 Jun 2025, Last Modified: 06 Aug 2025MoFA PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reward learning, distillation, preference optimization, large language models

TL;DR: Introduces ratio matching framework for reward distillation

Abstract: While Direct Preference Optimization (DPO) revolutionized language model alignment by eliminating the need for explicit reward models and reinforcement learning, scenarios with access to high-quality reward models (RMs) trained on extensive preference datasets still benefit from leveraging these resources. Reward model distillation techniques such as REBEL have emerged as part of a class of approaches that do not require the added complexity of reinforcement learning. In this paper, we show that REBEL can be derived as a ratio-matching objective with respect to DPO’s optimal policy. In addition, we generalize ratio matching into distribution matching, formulating a new, principled alignment objective in the multi-completion setting where Group Relative Policy Optimization (GRPO) is commonly used.

Submission Number: 69

Loading