A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A unified reduction framework for analyzing private and robust offline alignment
Abstract: In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.
Lay Summary: When training AI systems to follow human instructions, it’s important to learn from human preferences while keeping those preferences private and safe from manipulation. In this work, we explore what happens when the labels—people’s feedback—are both noisy and privacy-protected. We compare two possible situations: one where privacy is applied first and then the data is tampered with, and another where tampering happens first and then privacy is applied. We find that the order of these steps matters: protecting data before tampering makes the learning task much harder. Our analysis helps explain why some methods for training AI systems might perform better than others when dealing with noisy, private data—and provides guidance on how to build more reliable and trustworthy AI.
Primary Area: Social Aspects->Privacy
Keywords: Differential Privacy; LLM Alignment; Robustness
Submission Number: 10182
Loading