Preference Learning for AI Alignment: a Causal Perspective

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose to adopt a causal framework for preference learning to define and address challenges like causal misidentification, preference heterogeneity, and crucially, confounding due to user-specific objectives.
Abstract: Reward modelling from preference data is a crucial step in aligning large language models (LLMs) with human values, requiring robust generalisation to novel prompt-response pairs. In this work, we propose to frame this problem in a causal paradigm, providing the rich toolbox of causality to identify the persistent challenges, such as causal misidentification, preference heterogeneity, and confounding due to user-specific factors. Inheriting from the literature of casual inference, we identify key assumptions necessary for reliable generalisation and contrast them with common data collection practices. We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness. Finally, we outline desiderata for future research and practices, advocating targeted interventions to address inherent limitations of observational data.
Lay Summary: Aligning large language models (LLMs) with human values hinges on teaching them to recognize and prefer helpful, safe, and appropriate responses—an area known as reward modeling from human preferences. Traditionally, this involves collecting comparisons between different model outputs and training a model to learn what people tend to prefer. However, this process often overlooks deeper challenges, such as why people prefer certain outputs and how those preferences might vary across users or situations. In this work, we propose viewing reward modeling through a causal lens—a framework used to distinguish true cause-effect relationships from misleading patterns in data. This shift allows us to pinpoint and address core problems in current practices, such as learning from biased data, failing to account for varying user preferences, or mistaking irrelevant patterns for meaningful signals. By applying tools from causal inference, we show how some commonly used reward models can go wrong, and how models built with causal principles can better generalise to new situations. We also provide a roadmap for future research, encouraging the AI community to rethink how we collect and use human feedback. In particular, we recommend more deliberate experimentation and data collection strategies to overcome the limitations of passive, observational data.
Link To Code: https://github.com/kasia-kobalczyk/causal-preference-learning
Primary Area: Deep Learning->Large Language Models
Keywords: Preference learning, alignment, reward modelling, causality, robustness, confounding, heterogeneity
Submission Number: 12328
Loading