MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts

ACL ARR 2024 December Submission1877 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: LLMs can memorize sensitive information, raising concerns about potential misuse. LLM Unlearning, a post-hoc method to remove this information from trained LLMs, offers a promising solution to mitigating these risks. However, previous practices face three key challenges: 1. Utility: successful unlearning often causes catastrophic collapse on unrelated tasks. 2. Efficiency: many methods either involve adding models of similar size, which slows down unlearning or inference, or require retain data that are difficult to obtain. 3. Robustness: even effective methods may still leak data via extraction techniques. To address these challenges, we propose MEOW, a simple yet effective gradient descent-based unlearning method. Specifically, we use an offline LLM to generate a set of inverted facts. Then, we design a new metric, MEMO, to quantify memorization in LLMs. Finally, based on the signals provided by MEMO, we select the most appropriate set of inverted facts and finetune the model accordingly. We evaluate MEOW on the commonly used unlearn benchmark, ToFU, with Llama2-7B-Chat and Phi-1.5B, and test it on both NLU and NLG tasks. Results demonstrate significant improvement of MEOW in forget quality without substantial loss in model utility. Meanwhile, MEOW does not exhibit a significant drop in NLU or NLG performance, and there is even a slight increase in NLU capabilities.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: security/privacy
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1877
Loading