Abstract: As AI models are trained on ever-expanding datasets, the ability to remove the influence of specific data from trained models has become essential for privacy protection and regulatory compliance. Unlearning addresses this challenge by selectively removing parametric knowledge from the trained models without retraining from scratch, which is critical for resource-intensive models such as Large Language Models (LLMs). However, existing LLM unlearning methods are largely heuristic without formal guarantees and often degrade model performance by removing more information than necessary when attempting to "forget" specific data. We bridge the gap between rigorous unlearning theory and LLM practice by introducing Forgetting-MarI, an LLM unlearning framework that provably removes only the additional (marginal) information contributed by the data to be unlearned, while preserving the information supported by the data to be retained. By penalizing marginal information, our method yields an explicit upper bound on the unlearn dataset’s residual influence in the trained models, providing provable undetectability. We empirically validate the framework on medium-size LLMs (GPT-2-Large and Llama variants) across both one-time and continual unlearning settings. Forgetting-MarI achieves effective unlearning while better preserving retain-set and general model utility than existing baselines, and its empirical behavior is consistent with the theoretical undetectability. These results identify marginal-information regularization as a principled and practical route toward more verifiable and controllable LLM unlearning.
Loading