AI Alignment with Provable Protection of Human Judgements

AI Alignment with Provable Protection of Human Judgements

ICLR 2026 Conference Submission25286 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Alignment, RLHF, performance guarantees, asymptotic match

Abstract: Reinforcement learning from human preference rankings forms the basis for training language models to be helpful and value-aligned. As these powerful AI systems are trained for increasingly high-stakes tasks, the risk of leaking sensitive human training data increases. However, the problem of protecting human preference data is complicated by the fact that reinforcement learning from human feedback is a multistage pipeline involving learning a reward function from human preferences, and subsequently training a language model policy from the learned rewards. To address these issues, we design algorithms for the task of alignment from preference feedback that provably avoid leaking human preference data in both the Bradley-Terry and Plackett-Luce models. Our algorithms satisfy $\epsilon$-DP while matching the minimax optimal sample complexity for the task of aligning a policy to human preference rankings. These results demonstrate that there is no inherent tradeoff between protecting the privacy of human preferences and efficient alignment with human values.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 25286

Loading