Toward Robust Unlearning for LLMs

ICLR 2024 Workshop SeT LLM Submission114 Authors

Published: 04 Mar 2024, Last Modified: 06 May 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: robust unlearning, machine unlearning, large language models, ai safety, alignment, adversarial robustness
TL;DR: We introduce a framework for robust unlearning in LLMs and several methods that achieve state-of-the-art unlearning results.
Abstract: Recent rapid advances in AI enabled by large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. While traditional open-source software has long established mechanisms for combating such adversarial behavior, systems involving large neural networks are nontrivial to interpret---let alone intervene on---for safe use. Various alignment methods have been proposed to steer model responses towards a desired output distribution. However, these techniques are superficial and can be undone entirely with supervised fine-tuning. These vulnerabilities necessitate new approaches such as machine unlearning, in which the underlying representations of these target concepts are corrupted or forgotten. We introduce state-of-the-art methods for robustly unlearning desired concepts from LLMs, such that performance cannot be recovered by white-box fine-tuning. We demonstrate our results on the MMLU benchmark, showing that we can decrease accuracy on a forget set of concepts to chance levels while maintaining accuracy on the retain set.
Submission Number: 114