Knowledge Graph Unlearning to Defend Language Model Against Jailbreak Attack

Published: 19 Mar 2024, Last Modified: 20 Jun 2024Tiny Papers @ ICLR 2024 PresentEveryoneRevisionsBibTeXCC BY 4.0
Keywords: jailbreak attack, large language model, machine unlearning, knowledge graph
Abstract: Large language models (LLMs) are vulnerable to jailbreak attacks that bypass safety measures and induce LLMs to generate harmful content. There is a notable dearth of research on defense mechanisms against jailbreak attack, especially attacks that leverage fine-tuning techniques on open-access LLMs. To bridge this gap, this paper proposes the Knowledge Graph Unlearning (KGUnL) framework to remove harmful content from LLMs. The empirical study demonstrate the effectiveness of our framework on defending LLM against fine-tuning attacks.
Supplementary Material: zip
Submission Number: 117
Loading