Robust Knowledge Unlearning via Mechanistic Localizations

Published: 28 Jun 2024, Last Modified: 25 Jul 2024NextGenAISafety 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: machine unlearning, mechanistic interpretability, factual recall
TL;DR: Mechanistic interpretability informs robust unlearning with fewer side effects, but current automated interpretability approaches have clear weaknesses
Abstract: Methods for machine unlearning in large language models seek to remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates the use of mechanistic interpretability to improve the precision and effectiveness of unlearning. We demonstrate that localizing unlearning to components with particular mechanisms in factual recall leads to more robust unlearning across different input/output formats, relearning, and latent knowledge, and reduces unintended side effects compared to nonlocalized unlearning. Additionally, we analyze the strengths and weaknesses of different automated (rather than manual) interpretability methods for guiding unlearning, finding that their corresponding unlearned models require smaller edit sizes to achieve unlearning but are much less robust.
Submission Number: 100
Loading