Keywords: Large language models, Machine unlearning, Security, Privacy
Abstract: In recent years, large language models (LLMs) have achieved remarkable advancements. However, LLMs can inadvertently memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, recent research has introduced a series of promising machine unlearning techniques, namely LLM unlearning, to selectively remove specific content from LLMs. Yet, as a new paradigm, LLM unlearning may introduce critical security vulnerabilities by exposing additional interaction surfaces that adversaries can exploit, leading to emerging security threats against LLMs. Existing literature lacks a systematic understanding and comprehensive evaluation of unlearning attacks and their defenses in the context of LLMs. To bridge this gap, we introduce Language Unlearning Security Benchmark (LUSB), the first comprehensive framework designed to formalize, evaluate, and benchmark unlearning attacks and defenses against LLMs. Based on LUSB, we benchmark 16 different types of unlearning attack/defense methods across 13 LLM architectures, 9 LLM unlearning methods, and 12 task datasets. Our benchmark results reveal that unlearning attacks significantly undermine the security performance of LLMs, even in the presence of traditional LLM security defenses. Notably, unlearning attacks can not only amplify adversarial vulnerabilities of LLMs (e.g., increased susceptibility to jailbreak attacks) but also be exploited to gradually activate traditional poisoning or backdoor behaviors in LLMs. Further, our results underscore the limited effectiveness of existing defense strategies, emphasizing the urgent need for more advanced approaches to LLM unlearning security. We provide our benchmark in the supplementary material to facilitate further research in this area.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13906
Loading