Understanding Unlearning Difficulty: A Mechanistic Perspective and Difficulty Metric

ACL ARR 2026 January Submission7454 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unlearning, Difficulty, Mechanistic Interpretability
Abstract: Machine unlearning has emerged as a critical capability for trustworthy and compliant machine learning systems. Yet there is limited attention to a key fundamental question: \textit{why are some samples easy to unlearn while others are intrinsically hard?} In this work, we address this gap from a mechanistic perspective. We leverage model circuits--structured pathways of interactions that govern prediction formation--to analyze how memorized information is encoded. We propose CUD , a principled metric that quantifies sample-wise unlearning difficulty \emph{prior to unlearning}. Extensive experiments demonstrate that \metric reliably select easy and hard samples for unlearning, \ie the same unlearning procedure performs better on the selected easy samples, while falls short on the hard samples. We identify key circuit-level patterns that easy-to-unlearn samples are associated with short interactions located in easy-to-intermediate part of the original model, while hard samples have longer edges located in deeper part of the model. Compared to existing qualitative studies, CUD offers improved granularity and interpretability, and reveals internal model mechanisms in what constitutes unlearning difficulty. Our analysis provides new insights into how easy / hard samples are memorized and unlearned differently, potentially paving the path to new unlearning methods grounded in model mechanisms.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: explainability, mechanism, privacy, memorization
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7454
Loading