Keywords: Privacy, Membership Inference Attack, LLMs, Post-training
TL;DR: We show that existing MIAs can underestimate privacy leakage in preference-based LLM post-training, and introduce a stronger attack that uses both preferred and rejected responses to reveal higher leakage across several objectives.
Abstract: Preference-based post-training is critical for aligning large language models (LLMs) with human intent; however, it raises privacy concerns as the instruction and feedback data used in this stage may contain sensitive information, such as personal identifiers or user-specific preferences.
While membership inference attacks (MIAs) have been widely studied for pre-training and supervised fine-tuning, their effectiveness in the context of preference-based post-training remains less explored.
In this work, we systematically evaluate privacy vulnerabilities in modern post-training pipelines and present a systematic analysis of *strong* MIAs for preference-based post-training.
We introduce LiRA-J, a preference-aware variant of LiRA for membership inference on preference data.
Through comprehensive experiments across a range of datasets and model families, we reveal privacy risks and compare the most prevalent post-training approaches, uncovering vulnerability patterns. Our analysis further examines key factors that affect privacy risk in preference-based post-training, including regularization strategies.
Our findings highlight privacy vulnerabilities in preference-based post-training and underscore the need to audit aligned models with preference-aware membership inference protocols.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 28
Loading