Selective Knowledge Unlearning via Self-Distillation with Auxiliary Forget-Set Model

Published: 11 Jun 2025, Last Modified: 11 Jun 2025MUGen @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Unlearning, Large Language Models, Self Distillation
TL;DR: We propose a teacher-student-forget distillation framework for language model unlearning that effectively removes specific data influences while preserving model utility.
Abstract: We propose a novel machine unlearning method based on self-distillation that enables selective removal of specific training data from large language models. Our approach uses an auxiliary model trained solely on the data to be forgotten to generate logits-based penalties during fine-tuning, guiding the student model to reduce confidence on memorized tokens related to the forgotten subset. This dynamic penalty outperforms fixed masking strategies by precisely targeting residual knowledge while preserving performance on retained data. We validate our method on WikiText-2, showing increased perplexity and reduced topk accuracy on the forgotten data, indicating effective unlearning. At the same time, the model maintains strong generalization on the remaining dataset, minimizing unintended forgetting. These results demonstrate that logits-guided selfdistillation is a promising direction for efficient and scalable machine unlearning.
Submission Number: 15
Loading