Abstract: LLM Unlearning plays a crucial role in removing sensitive information from language models to mitigate potential misuse.
However, previous approaches often treat nonsensical responses or template-based refusals (e.g., ``Sorry, I cannot answer.'') as the unlearning target, which can give the impression of deliberate information suppression, making the process even more vulnerable to attacks and jailbreaks.
Moreover, most methods rely on auxiliary models or retaining datasets, which adds complexity to the unlearning process.
To address these challenges, we propose MEOW, a streamlined and stealthy unlearning method that eliminates the need for auxiliary models or retaining data while avoiding leakage through its innovative use of inverted facts.
These inverted facts are generated by an offline LLM and serve as fine-tuning labels. Meanwhile, we introduce MEMO, a novel metric that measures the model’s memorization, to select optimal fine-tuning targets.
The use of inverted facts not only maintains the covert nature of the model but also ensures that sensitive information is effectively forgotten without revealing the target data.
Evaluated on the ToFU Knowledge Unlearning dataset using Llama2-7B-Chat and Phi-1.5, MEOW outperforms baselines in forgetting quality while preserving model utility. MEOW also maintains strong performance across NLU and NLG tasks and demonstrates superior resilience to attacks, validated via the Min-K\% membership inference method.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: security/privacy
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 3658
Loading