Evaluating Deep Unlearning in Large Language Models

Published: 11 Jun 2025, Last Modified: 11 Jun 2025MUGen @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: unlearning, LLM
Abstract: Machine unlearning has emerged as an important component in developing safe and trustworthy models. Prior work on unlearning in LLMs has mostly considered unlearning tasks where a large corpus of copyrighted material or some specific training data are required to be removed. In this work, we consider the task of unlearning a fact from LLMs, which can be challenging as related facts can be deduced from each other. We formally propose a new setting of unlearning, deep unlearning, which considers fact unlearning under logical deductions between facts and design a metric recall, to quantify the extent of deep unlearning. To enable us to systematically evaluate the extent of deep unlearning, we construct a synthetic dataset Eval-DU, which consists of a synthetic knowledge base of family relationships and biographies, together with a realistic logical rule set that connects them. We experimentally investigate how well current unlearning methods for LLMs succeed at deep unlearning. Our findings reveal that in the task of deep unlearning only a single fact, they either fail to properly unlearn with high recall, or end up unlearning many other irrelevant facts. Our results suggest that more targeted algorithms may have to developed for fact unlearning in LLMs.
Submission Number: 19
Loading