Keywords: machine unlearning, mechanistic interpretability, factual recall
TL;DR: Mechanistic interpretability informs robust unlearning with fewer side effects, but current automated interpretability approaches have clear weaknesses
Abstract: Methods for machine unlearning in large language models seek to remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates the use of mechanistic interpretability to improve the precision and effectiveness of unlearning.
We demonstrate that localizing unlearning to components with particular mechanisms in factual recall leads to more robust unlearning across different input/output formats, relearning, and latent knowledge, and reduces unintended side effects compared to nonlocalized unlearning.
Additionally, we analyze the strengths and weaknesses of different automated (rather than manual) interpretability methods for guiding unlearning, finding that their corresponding unlearned models require smaller edit sizes to achieve unlearning but are much less robust.
Submission Number: 100
Loading