Keywords: jailbreak attack, large language model, machine unlearning, knowledge graph
Abstract: Large language models (LLMs) are vulnerable to jailbreak attacks that bypass safety measures and induce LLMs to generate harmful content. There is a notable dearth of research on defense mechanisms against jailbreak attack, especially attacks that leverage fine-tuning techniques on open-access LLMs. To bridge this gap, this paper proposes the Knowledge Graph Unlearning (KGUnL) framework to remove harmful content from LLMs. The empirical study demonstrate the effectiveness of our framework on defending LLM against fine-tuning attacks.
Supplementary Material: zip
Submission Number: 117
Loading