Abstract: As large language models (LLMs) are increasingly deployed in diverse applications, including chatbot assistants and code generation, aligning their behavior with safety and ethical standards has become paramount. However, jailbreak attacks, which exploit vulnerabilities to elicit unintended or harmful outputs, threaten LLM safety significantly. In this paper, we introduce Layer-AdvPatcher,a novel methodology designed to defend against jailbreak attacks by utilizing targeted unlearning strategies on specific layers within LLMs through self-augmented datasets. Our insight is that certain layer(s), tend to produce affirmative tokens when faced with harmful prompts. By identifying these layers and fine-tuning them, we expose these vulnerabilities. With the exposures, we then “unlearn” these issues, reducing the impact of affirmative tokens and hence minimizing jailbreak risks while keeping the model’s responses to safe queries intact. We conduct extensive experiments on two models, four Benchmark datasets, and multiple state-of-the-art jailbreak benchmarks to demonstrate the efficacy of our approach. Results indicate that Layer-AdvPatcher reduces the harmfulness and attack success rate of jailbreak without compromising utility for benign queries. Meanwhile, Layer-AdvPatcher outperforming several defense methods. Our code is publicly available at: https://anonymous.4open.science/r/LayerBugFixer-6B28.
Loading