Keywords: Machine Unlearning, Large Language Models, Self Distillation
TL;DR: We propose a teacher-student-forget distillation framework for language model unlearning that effectively removes specific data influences while preserving model utility.
Abstract: We propose a novel machine unlearning method
based on self-distillation that enables selective
removal of specific training data from large language models. Our approach uses an auxiliary
model trained solely on the data to be forgotten to
generate logits-based penalties during fine-tuning,
guiding the student model to reduce confidence on
memorized tokens related to the forgotten subset.
This dynamic penalty outperforms fixed masking
strategies by precisely targeting residual knowledge while preserving performance on retained
data. We validate our method on WikiText-2,
showing increased perplexity and reduced topk accuracy on the forgotten data, indicating effective unlearning. At the same time, the model
maintains strong generalization on the remaining dataset, minimizing unintended forgetting.
These results demonstrate that logits-guided selfdistillation is a promising direction for efficient
and scalable machine unlearning.
Submission Number: 15
Loading