Keywords: Responsible AI, language models, protecting human judgements
Abstract: With the increasing ubiquity of large language models it has become crucial to ensure guarantees for models trained to be aligned with human values to avoid leaking information on the human judgements that have been provided to the algorithm. To target this issue we focus on the problem of alignment via reinforcement learning from human preference rankings, subject to the constraint of 
not revealing any information on the human data used to align the model. To achieve this, we analyze $(\epsilon,\delta)$-DP for both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. We introduce a theoretically founded algorithm for learning rewards from human rankings that achieves this objective without leaking the human rankings. We further demonstrate that the privately learned rewards can be used to train policies achieving statistical performance guarantees that asymptotically match the best known algorithms in the non-private setting, which are in some cases minimax optimal. Strikingly, our analysis and our results reveal that it is possible to obtain the same model performance without any trade-off on the protection of the human judgments, and our paper provides the first algorithms that can achieve provable privacy of human judgements, while still producing aligned models with optimal performance.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11798
Loading