Keywords: Causal Inference, Machine Learning for Decision Making, Large Language Models (LLMs), Reward Modeling
TL;DR: We highlight both the potential value and the risks of using observational data for LLM alignment, and introduce a theoretical framework and method to mitigate confounding effects.
Abstract: Large language models are being widely used across industries to generate text that contributes directly to key performance metrics, such as medication adherence in patient messaging and conversion rates in content generation. Pretrained models, however, often fall short when it comes to aligning with human preferences or optimizing for business objectives. As a result, fine-tuning with good-quality labeled data is essential to guide models to generate content that achieves better results. Controlled experiments, like A/B tests, can provide such data, but they are often expensive and come with significant engineering, logistical, and ethical challenges. Meanwhile, companies have access to a vast amount of historical (observational) data that remains underutilized. In this work, we study the challenges and opportunities of fine-tuning LLMs using observational data. We show that while observational outcomes can provide valuable supervision, directly fine-tuning models on such data can lead them to learn spurious correlations. We present empirical evidence of this issue using various real-world datasets and propose DeconfoundLM, a method that explicitly removes the effect of known confounders from reward signals. In simulation experiments, DeconfoundLM more accurately recovers causal relationships and mitigates failure modes of methods that assume counterfactual invariance, achieving over 16\% higher objective score than ODIN and other baselines, when entangled confounding is present. Please refer to the project page (https://deconfoundlm.github.io/) for code and related resources.
Submission Number: 30
Loading