Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Phillip Huang Guo; Aaquib Syed; Abhay Sheshadri; Aidan Ewart; Gintare Karolina Dziugaite

Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We achieve stronger robustness of LLM factual editing/unlearning when localizing to components identified by certain kinds of mechanistic analysis

Abstract: Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability---which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability---can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the *lookup-table mechanism* for factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models. We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.

Lay Summary: We address the problem of editing or removing specific factual knowledge from large language models without compromising overall performance. Much of the existing literature simply leverages stronger adversarial training, which is weak against basic attacks to recover the original knowledge. We wanted to approach editing differently: by understanding mechanistically where undesired knowledge exists, so that we could fully remove knowledge from the weights. We tackle this by first using mechanistic interpretability to identify precise components in the model responsible for connecting subjects in the prompt to their corresponding unwanted factual representations. Then we solely fine-tune these components to represent alternative factual representations, and find that this yields an edited model far more resistant to attempts at recovering the original facts. Having a robust factual editing method provides a way to surgically remove harmful knowledge and sensitive or private information. This enhances control over AI behavior, enabling safer language models.

Primary Area: Deep Learning->Large Language Models

Keywords: Model Editing, Unlearning, Mechanistic Interpretability, Localization, Adversarial Robustness

Submission Number: 9408

Loading