Playing the Fool: Jailbreaking Large Language Models with Out-of-Distribution Strategies

Joonhyun Jeong; Seyun Bae; Yeonsung Jung; Jaeryong Hwang; Eunho Yang

Playing the Fool: Jailbreaking Large Language Models with Out-of-Distribution Strategies

Joonhyun Jeong, Seyun Bae, Yeonsung Jung, Jaeryong Hwang, Eunho Yang

24 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Multimodal Large Language Models, Safety, Jailbreak Attacks

TL;DR: We introduce a new attack strategy that generates out-of-distribution inputs to bypass safety alignment in LLMs and MLLMs, successfully jailbreaking models like GPT-4 and GPT-4V.

Abstract: Despite the remarkable versatility of Large Language Models (LLMs) and Multimodal-LLMs (MLLMs) to generalize across both language and vision tasks, LLMs and MLLMs have shown vulnerability to jailbreaking, generating textual outputs that undermine safety, ethical, and bias standards when exposed to harmful or sensitive inputs. With the recent advancement of safety-alignment via preference-tuning from human feedback, LLMs and MLLMs have been equipped with safety guardrails to yield safe, ethical, and fair responses with regard to harmful inputs. However, despite the significance of safety-alignment, research on the vulnerabilities remains largely underexplored. In this paper, we investigate the vulnerability of the safety-alignment, examining its ability to consistently provide safety guarantees for out-of-distribution(OOD)-ifying harmful inputs that may fall outside the aligned data distribution. Our key observation is that OOD-ifying the vanilla harmful inputs highly increases the uncertainty of the model to discern the malicious intent within the input, leading to a higher chance of being jailbroken. Exploiting this vulnerability, we propose JOOD, a new Jailbreak strategy via generating OOD-ifying inputs beyond the safety-alignment with diverse visual and textual transformation techniques. Specifically, even simple mixing-based techniques such as image mixup prove highly effective in OOD-ifying the harmful inputs by increasing the uncertainty of the model, thereby facilitating the bypass of the safety-alignment. Experimental results across diverse jailbreak scenarios demonstrate that JOOD effectively jailbreaks recent proprietary LLMs and MLLMs such as GPT-4 and GPT-4V with high attack success rate, which previous attack approaches have consistently struggled to jailbreak.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3516

Loading