Keywords: Privacy Protection for Large Language Models, Neuron Detection and Editing, Machine Unlearning
Abstract: The rapid advancement of large language models (LLMs) has significantly propelled downstream innovation, yet pervasive sensitive information in training data and the models' memory characteristics pose severe privacy leakage risks. This contravenes core requirements of the General Data Protection Regulation (GDPR) and the right to be forgotten, becoming a critical bottleneck for secure and compliant deployment. Existing privacy protection methods have notable limitations: data preprocessing fails to cover context-dependent sensitive information; differential privacy (DP) and homomorphic encryption (HE) degrade model performance and increase computational overhead; traditional machine unlearning may cause catastrophic collapse; and neuron editing methods struggle with the accuracy-efficiency trade-off in privacy neuron localization, alongside privacy seesaw phenomena and general performance degradation. To address these challenges, this paper proposes LDEDE, a Layer-wise Relevance Propagation (LRP)-driven framework for efficient privacy neuron detection and editing. It offers three core advantages: 1) Precise multi-scale privacy localization via LRP-based relevance backpropagation and multi-token attention aggregation, achieving over 80% higher efficiency than gradient attribution methods; 2) First reveals the existence of "coupled privacy neurons" in LLMs, which are the key cause of the privacy seesaw phenomenon—mitigated by Polarity-Aware Neuron Editing (PANE) with differentiated logic; 3) Enhanced robustness and generalization for batch processing via privacy neuron aggregation. Experiments on Enron and MIMIC datasets demonstrate that compared to baselines, LDEDE maintains comparable general performance while reducing leakage risks of Phone, Email, and medical privacy by 42.7%–73.5% on average and cutting computational time by 60%–90%. It also exhibits stable performance across GPT-2, BERT-base, and LLAMA-7B, providing an efficient, lightweight solution for post-deployment dynamic LLM privacy protection.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: NLP Applications, Efficient/Low-Resource Methods for NLP
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1683
Loading