Merging Improves Self-Critique Against Jailbreak Attacks

Victor Gallego

Merging Improves Self-Critique Against Jailbreak Attacks

Victor Gallego

Published: 03 Jul 2024, Last Modified: 14 Jul 2024ICML 2024 FM-Wild Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: jailbreaks, LLM, safety, synthetic data

TL;DR: By merging a base LLM with a critic LLM, we find that its self-critique abilities drastically improve, helping in defending against jailbreak attacks

Abstract: The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .

Submission Number: 120

Loading