GUARD: Gold-Unchanged Anchored Distillation for Defending LLMs Against Membership Inference Attacks

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: safety, privacy, MIA defense
Abstract: Large language models (LLMs) are widely fine-tuned for many domain-specific tasks that often contain sensitive and private data. This heightens the risk of membership inference attacks (MIAs), which aim to infer whether a particular sample appeared in training. Prior work has developed increasingly strong MIAs for fine-tuned LLMs, but practical and effective defenses remain significantly limited. The core challenge is a privacy-utility tension: fine-tuning improves utility by increasing confidence on the ground-truth (“gold”) token, yet this shift creates statistical differences that reveal membership. In this work, we introduce GUARD (Gold-Unchanged Anchored Distillation), a novel, robust, and lightweight defense that mitigates privacy leakage while preserving model utility. GUARD first fine-tunes a teacher model on downstream data to capture generalization and memorization capabilities. It then constructs an anchored target distribution by fixing the gold token’s probability to its pre-trained value and preserving the fine-tuned model’s ranking among non-gold tokens while assigning them pre-trained magnitudes. A student is distilled to match this target. This design suppresses the dominant membership signal while retaining task-relevant distributional structure. Across diverse model families and benchmarks, GUARD demonstrates state-of-the-art downstream utility, enhanced robustness against membership inference attacks, improved design efficiency, and strong scalability across tasks. Code will be released upon acceptance.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 2171
Loading