Keywords: unlearning, privacy, LLMs
TL;DR: DAWI is an unlearning algorithm that does not use a traditional optimizer and reaches competitive results on TOFU.
Abstract: Large Language Models (LLMs) are trained on vast amounts of text data, and they frequently memorize sensitive or private information that appears in the training corpus. This raises significant privacy, security, and ethical concerns, particularly when such information can be extracted by adversarial prompts or from membership inference attacks. Machine unlearning has therefore emerged as an important research direction, with the goal of selectively removing knowledge of problematic information from a model while preserving its general language understanding and reasoning capabilities. In this work, we focus on unlearning information that is memorized during the fine-tuning phase of training, which commonly happens when inference providers fine-tune models on user interactions. We introduce Dual Anchor Weighted Interpolation (DAWI), a simple yet effective unlearning algorithm that achieves state-of-the-art results on the TOFU benchmark and demonstrates strong unlearning efficacy with minimal hyperparameter tuning, making it practical for real-world deployment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 6887
Loading