Keywords: robust unlearning, machine unlearning, large language models, ai safety, alignment, adversarial robustness
TL;DR: We introduce a framework for robust unlearning in LLMs and several methods that achieve state-of-the-art unlearning results.
Abstract: Recent rapid advances in AI enabled by large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. While traditional open-source software has long established mechanisms for combating such adversarial behavior, systems involving large neural networks are nontrivial to interpret---let alone intervene on---for safe use. Various alignment methods have been proposed to steer model responses towards a desired output distribution. However, these techniques are superficial and can be undone entirely with supervised fine-tuning. These vulnerabilities necessitate new approaches such as machine unlearning, in which the underlying representations of these target concepts are corrupted or forgotten. We introduce state-of-the-art methods for robustly unlearning desired concepts from LLMs, such that performance cannot be recovered by white-box fine-tuning. We demonstrate our results on the MMLU benchmark, showing that we can decrease accuracy on a forget set of concepts to chance levels while maintaining accuracy on the retain set.
Submission Number: 114
Loading