Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning
Abstract: As large language models (LLMs) are applied across diverse domains, the ability to selectively unlearn specific information has become increasingly essential. For instance, LLMs are expected to provide certain confidential information to authorized internal users, such as employees or trusted partners, while withholding it from external users, including the general public and unauthorized entities.
In response to this challenge, we propose a novel method termed ''in-context knowledge unlearning'', which enables the model to selectively forget information in test-time based on the query context.
Our method fine-tunes pre-trained LLMs to enable prompt unlearning of target knowledge within the context, while preserving other knowledge.
Experiments on TOFU, AGE and RWKU datasets using Llama2-7B/13B and Mistral-7B models show that our method achieves up to 95\% forget accuracy while retaining 80\% of unrelated knowledge, significantly outperforming baselines in both in-domain and out-of-domain scenarios.
Further investigation of the model's internal behavior revealed that while fine-tuned LLMs generate correct predictions in the middle layers and maintain them up to the final layer, they make the decision to forget at the last layer, i.e. ''LLMs pretend to forget''.
Our findings offer valuable insight into the improvement of the robustness of the unlearning mechanisms in LLM, setting a foundation for future research in the field.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Machine unlearning, In-context unlearning, Right to be forgotten, Approximate data deletion
Languages Studied: English
Submission Number: 1095
Loading