KARMA: Keyword-Aware Representation Modification for Efficient and Robust Model Amnesiac Unlearning

KARMA: Keyword-Aware Representation Modification for Efficient and Robust Model Amnesiac Unlearning

ICLR 2026 Conference Submission16266 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine unlearning, Pre-trained language model, Natural language processing

Abstract: Pre-trained Language Models (LMs) struggle with efficiently removing specific data samples and associated knowledge due to their massive scale and computational requirement. Existing machine unlearning methods suffer from excessive parameter updates and an imbalanced forgetting-remaining performance. We first derived critical insight that fine-tuning only a single layer of the model achieves competitive performance to full-model fine-tuning. Inspired by this observation, we introduce KARMA (Keyword-Aware Representation Modification for Model Amnesiac Unlearning), which efficiently forgets representation traces by selectively perturbing the embedding parameters of semantically critical tokens, while restricting parameter updates within a bounded spherical region to preserve stability. Specifically, to identify high-influence keywords, we first introduce a Fisher scoring mechanism that precisely captures the semantics of data to be forgotten. To further enhance privacy during the unlearning process, we propose a keyword-driven pseudo sample based method that eliminates the need for raw data by inserting keyword embeddings within irrelevant corpora. Moreover, to mitigate the adverse impact on the remaining samples, we propose a bounded fine-tuning regularization strategy to prevent excessive semantic drift in the representation space. The efficiency of KARMA is underpinned by rigorous convergence radius analysis, and the robustness of KARMA on remaining samples is theoretically proved by the bounded regularization strategy. Experiments on sentiment classification show that KARMA achieves near-retraining efficacy with a 99.5\% reduction in parameter updates compared to gradient-based methods, while exhibiting a low performance degradation on retained data.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 16266

Loading