Direct Reward Distillation: A Point-wise Alignment Approach

Direct Reward Distillation: A Point-wise Alignment Approach

ACL ARR 2025 February Submission8425 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Direct Alignment Algorithms (DAAs) are widely used for aligning Large Language Models (LLMs) to human preferences. The current DAAs are using pairwise optimizing objectives based on the variants of Direct Preference Optimization (DPO). However, these methods only focus on the pairwise differences of the samples and cannot prevent optimization from reducing the probabilities of preferred responses. In this paper, we present Direct Reward Distillation (DRD), an algorithm that uses an explicit reward model to optimize the policy by setting an exact probability target for each response. DRD decouples target reward differentials and bias in aligning objectives and utilizing not only the relationship within response pairs but also the relationship among them. Our experiments show that DRD performs better than existing methods while providing controllability to the policy response probability.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: text-to-text generation,optimization methods,generative models

Contribution Types: NLP engineering experiment, Theory

Languages Studied: English

Submission Number: 8425

Loading