Mutual-Taught for Boosting Policy and Reward Models

Mutual-Taught for Boosting Policy and Reward Models

ACL ARR 2024 December Submission1857 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Preference optimization has emerged as an effective technique for aligning large language models (LLMs) with human objectives. However, as training progresses, distribution shifts can occur between newly generated model samples and the data used to train the reward model (RM), reducing the RM’s effectiveness and constraining the policy model’s (PM) performance. To address this challenge, we propose a self-training technique called \textbf{Mutual-Taught} that jointly improves both the PM and the RM without relying on additional human supervision. Our method is inspired by the Expectation-Maximization (EM) algorithm. In the E-step, we update the PM based on feedback from the current RM, guiding the PM toward a better approximation of the latent optimal preference distribution. In the M-step, we update the RM by constructing training data from the PM’s outputs before and after the E-step update, thereby adapting the RM to the evolving policy distribution. Experimental results show that this iterative process steadily improves both models. Our 8B policy model, LLaMA3-8B-Instruct-MT, achieves a length-controlled win rate of 52.0\% on AlpacaEval-2. Meanwhile, our 8B reward model, FsfairX-LLaMA3-RM-MT, attains performance on par with GPT-4o-2024-08-06 on RewardBench.

Paper Type: Long

Research Area: Generation

Research Area Keywords: Generation

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 1857

Loading