Knowledge Graph Unlearning to Defend Language Model Against Jailbreak Attack

Peihua Mai; Hao Jiang; Ran Yan; Youjia Yang; Zhe Huang; Yan Pang

Knowledge Graph Unlearning to Defend Language Model Against Jailbreak Attack

Peihua Mai, Hao Jiang, Ran Yan, Youjia Yang, Zhe Huang, Yan Pang

Published: 19 Mar 2024, Last Modified: 20 Jun 2024Tiny Papers @ ICLR 2024 PresentEveryoneRevisionsBibTeXCC BY 4.0

Keywords: jailbreak attack, large language model, machine unlearning, knowledge graph

Abstract: Large language models (LLMs) are vulnerable to jailbreak attacks that bypass safety measures and induce LLMs to generate harmful content. There is a notable dearth of research on defense mechanisms against jailbreak attack, especially attacks that leverage fine-tuning techniques on open-access LLMs. To bridge this gap, this paper proposes the Knowledge Graph Unlearning (KGUnL) framework to remove harmful content from LLMs. The empirical study demonstrate the effectiveness of our framework on defending LLM against fine-tuning attacks.

Supplementary Material: zip

Submission Number: 117

Loading