WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols

ICLR 2026 Conference Submission21911 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine unlearning, Approximate unlearning, Neural teleportation, Weight-space symmetries, Privacy attacks, Membership inference, Model inversion, Data reconstruction
TL;DR: A plug-and-play weight-teleportation step reduces alignment with forget-set gradients while preserving predictions, cutting MIA/DRA success across unlearning methods without harming retain-set accuracy.
Abstract: Approximate machine unlearning aims to efficiently remove the influence of specific data points from a trained model, offering a practical alternative to full retraining. However, it introduces privacy risks: an adversary with access to both the original and unlearned models can exploit their differences for membership inference or data reconstruction. We show these vulnerabilities arise from two factors: large gradient norms of forgotten samples and the close proximity of the unlearned model to the original. To demonstrate their severity, we design unlearning-specific membership inference and reconstruction attacks, showing that several state-of-the-art methods (such as NGP and SCRUB) remain vulnerable. To mitigate this leakage, we introduce WARP, a plug-and-play teleportation defense that leverages neural network symmetries to reduce gradient energy of forgotten samples and increase parameter dispersion while preserving accuracy. This reparameterization hides the signal of forgotten data, making it harder for attackers to distinguish forgotten samples from non-members or to recover them through reconstruction. Across six unlearning algorithms, our approach achieves consistent privacy gains, reducing adversarial advantage by up to 64% in black-box settings and 92% in white-box settings, while maintaining accuracy on retained data. These results highlight teleportation as a general tool for improving privacy in approximate unlearning.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21911
Loading