Keywords: Membership Inference Attacks; LLMs
Abstract: The proliferation of Large Language Models (LLMs) has been accompanied by escalating concerns over training data privacy in practice. Membership Inference Attacks (MIA), aiming to identify whether specific data was used for training, represent a significant privacy risk. However, the efficacy of existing MIA against the unique scale and complexity of contemporary LLMs has remained limited. This paper introduces OR-MIA, a novel and theoretically-inspired MIA framework that substantially advances attack capabilities against LLMs. This approach is predicated on two key principles on model optimization and input robustness. First, due to conventional optimization dynamics, data points seen during training are expected to exhibit smaller gradient norms with respect to the model parameters. Second, these member samples tend to demonstrate greater stability, meaning their gradient norms are less sensitive to controlled input perturbations compared to non-member samples. OR-MIA operationalizes these insights by systematically perturbing input samples, computing a sequence of gradient norms, and utilizing these as features for a robust classifier to distinguish members from non-members. Extensive evaluations on LLMs ranging from 70M to 6B parameters and various benchmark datasets demonstrate that OR-MIA achieves significantly higher attack accuracy and sample efficiency than existing state-of-the-art methods, often exceeding 90\% accuracy. Our findings reveal a critical and previously underappreciated vulnerability in LLMs, establishing a new benchmark for MIA performance and underscoring the urgent need for developing more effective privacy-preserving training paradigms.
Submission Number: 34
Loading