TL;DR: Formal symbolic characterizations of DPO and preference learning.
Abstract: Recent direct preference alignment algorithms (DPA), such as DPO, have shown great promise in aligning large language models to human preferences. While this has motivated the development of many new variants of the original DPO loss, understanding the differences between these recent proposals, as well as developing new DPA loss functions, remains difficult given the lack of a technical and conceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic program that characterizes its semantics? We propose a novel formalism for characterizing preference losses for single model and reference model based approaches, and identify symbolic forms for a number of commonly used DPA variants. Further, we show how this formal view of preference learning sheds new light on both the size and structure of the DPA loss landscape, making it possible to not only rigorously characterize the relationships between recent loss proposals but also to systematically explore the landscape and derive new loss functions from first principles. We hope our framework and findings will help provide useful guidance to those working on human AI alignment.
Lay Summary: We study a core problem in the development of large language models such as ChatGPT called preference alignment. This involves training models to mimic the preferences of certain agents (e.g., preferences related to how language models should behave given certain inputs, such as avoiding generating offensive content). Our specific aim is to better understand the underlying algorithms used to do preference aligment by formalizing these algorithms in terms of symbolic programs (i.e., the kinds of formal representations familiar from traditional computer science). Such programs help us to formally reason about existing algorithms, as well as derive new algorithms from first principles.
Primary Area: Deep Learning->Algorithms
Keywords: preference learning, neuro-symbolic, logic, reasoning, model alignment
Submission Number: 12091
Loading