Keywords: LLMs, Differential Privacy, Alignment
TL;DR: We propose LLM-based alignment techniques that ensure differential privacy for the labelers of the preference datasets typically used in alignment..
Abstract: Alignment is a crucial part in the implementation pipeline of Large Language Models (LLMs) that utilizes human feedback to ensure that LLMs adhere to human values and societal norms. This introduces privacy threats associated with the identity and preferences of the labelers responsible for creating the human feedback data. Several recent works have explored using differential privacy (DP) as a notion to protect the privacy of human labeled data; primarily relying on DP-SGD based solutions, which privatize the gradients during fine-tuning and alignment. Human preferences, however are only associated with the labels of the (prompt, response) tuples; therefore DP-SGD based approaches can be superfluous, providing more privacy than necessary and can degrade model utility. In this work, we focus on the problem of aligning LLMs with preference level privacy, which only preserve the privacy of preferences provided by humans. We build and expand upon the concept of label DP for this problem, and present a series of increasingly sophisticated, yet practical privacy preserving mechanisms for alignment. Specifically, starting from a standard randomized response (RR) mechanism which randomly flips human preferences, and it's corresponding \textit{unbiased} RR mechanism (which ensures an unbiased loss during alignment), we propose a new mechanism, PROPS (PROgressively Private Self-alignment). PROPS works in multiple stages as follows: in each stage, the privately trained and partially aligned model from the previous stage to act as a labeler for the training data for the next stage and combine it with RR which is repeated across multiple stages. Motivation for PROPS comes from the following critical observations: a) learning to label correct preferences might be an easier problem than generating responsible content; b) progressively combining RR with partially aligned models for labeling preferences significantly reduces the amount of necessary perturbation needed for privacy and also shows the potential of possibly reducing the number of human labeled preference samples. We present proof-of-concept experiments that demonstrate the feasibility and effectiveness of our proposed approach and show that preference privacy based alignment can still attain a comparable utility to their non-privately aligned counterparts.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7500
Loading