WARP: On the Benefits of Weight Averaged Rewarded Policies

Alexandre Rame; Johan Ferret; Nino Vieillard; Robert Dadashi-Tazehozi; Leonard Hussenot; Pierre-Louis Cedoz; Pier Giuseppe Sessa; Sertan Girgin; Arthur Douillard; Olivier Bachem

WARP: On the Benefits of Weight Averaged Rewarded Policies

Alexandre Rame, Johan Ferret, Nino Vieillard, Robert Dadashi-Tazehozi, Leonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, Olivier Bachem

20 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Alignment, RLHF, Model Merging

TL;DR: We introduce a new RLHF strategy; by merging the weights of the trained policies, we improve the alignment while limiting forgetting and hacking.

Abstract: Reinforcement learning from human feedback (RLHF) aligns large language models by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its initialization, though it hinders the reward optimization. To address the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP), merging policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with Gemma policies validate that WARP improves their quality and alignment, outperforming open-source models.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2104

Loading