Robust Chain of Thoughts Preference Optimization

Published: 01 Aug 2024, Last Modified: 09 Oct 2024EWRL17EveryoneRevisionsBibTeXCC BY 4.0
Keywords: RLHF, preference learning, LLM alignment
Abstract: Learning from human preferences has become the dominant paradigm in RL fine-tuning of large language models (LLMs). In particular human preferences are often distilled in the form of a reward model. Then this reward model is used through online RL methods to fine-tune the LLM. Alternatively offline methods like direct preference optimization (DPO) and Identity Preference Optimization (IPO) use contrastive losses to optimize the LLM directly by increasing the gap between the log-likelihoods of preferred and dis-preferred completions. Despite their success, these methods all suffer from a fundamental problem that their optimal solution highly depends on (and heavily optimized for) the behavior policy that has generated the completions of the preferences dataset. Therefore the solution of the existing methods may be prone to out-of-distribution (OOD) tasks where the validation dataset is significantly different from the behavior policy. Here we address this challenge by proposing Robust Chain of Thoughts Optimization (RoCoTO) of preferences, a practical and mathematically principled offline framework for reinforcement learning from human feedback that is completely robust to the changes in the behavior model. The key idea of \rocoto is to cast the problem of learning from human preferences as a self-improving chain of thoughts (CoT) process in which the goal is to learn a policy that is nearly perfect in the sense that its generations can be only minimally improved through the best self-improving CoT model. We show that this idea can be mathematically expressed in terms of a min-max optimization objective that aims at joint optimization of chain-of-thoughts policy and the main generative policy in an adversarial fashion. The solution for this joint optimization problem is independent of the behavior policy and thus it is robust to the changes in the behavior model. We then show that this objective can be re-expressed in the form of a non-adversarial IPO (DPO)-style (offline) loss which can be optimized using standard supervised optimization techniques at scale without any need for reward model and online inference. We show the effectiveness of RoCoTO in solving TL;DR summarization task and show its superiority to the baseline IPO and DPO when evaluated on OOD XSUM.
Submission Number: 146
Loading