Keywords: LLM Security, Jailbreak, Fine-tuning
TL;DR: Exploiting the overfitting of LLMs, we use only ten benign QA pairs to fine-tune and jailbreak them.
Abstract: Despite substantial efforts in safety alignment, recent research indicates that Large
Language Models (LLMs) remain highly susceptible to jailbreak attacks. Among
these attacks, finetuning-based ones that compromise LLMs’ safety alignment via
fine-tuning stand out due to its stable jailbreak performance. In particular, a recent
study indicates that fine-tuning with as few as 10 harmful question-answer (QA)
pairs can lead to successful jailbreaking across various harmful questions. However,
such malicious fine-tuning attacks are readily detectable and hence thwarted by
moderation models. In this paper, we demonstrate that LLMs can be jailbroken
by fine-tuning with only 10 benign QA pairs; our attack exploits the increased
sensitivity of LLMs to fine-tuning data after being overfitted. Specifically, our
fine-tuning process starts with overfitting an LLM via fine-tuning with benign QA
pairs involving identical refusal answers. Further fine-tuning is then performed
with standard benign answers, causing the overfitted LLM to forget the refusal
attitude and thus provide compliant answers regardless of the harmfulness of a
question. We implement our attack on the ten LLMs and compare it with five
existing baselines. Experiments demonstrate that our method achieves significant
advantages in both attack effectiveness and attack stealth. Our findings expose
previously unreported security vulnerabilities in current LLMs and provide a new
perspective on understanding how LLMs’ security is compromised, even with
benign fine-tuning. Our code is available at https://github.com/ZHIXINXIE/ten_benign.git.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 20575
Loading