NEMESIS \\ Jailbreaking LLMs with Chain of Thoughts Approach

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Jailbreaking, Chain-of-thought reasoning, Reinforcement learning, LLM security protocols, Adversarial attacks, Defense mechanisms, LlamaGuard, Multishot Jailbreaking, Fine Tuning
TL;DR: Study reveals 5 LLM jailbreak methods, proving vulnerabilities. Uses chain-of-thought reasoning; proposes reinforcement learning fixes. Challenges current safeguards, urges better fine-tuning and stronger protections.
Abstract: Large Language Models (LLMs) are increasingly being deployed across various applications, making the need for robust security measures crucial. This paper explores multiple methods for jailbreaking these models, bypassing their secu- rity protocols. By examining five distinct approaches—Multishot Jailbreaking, the Mirror Dimension Approach, the Cipher Method, the ”You are Answering the Wrong Question” Method, and the Textbook Jailbreaking Method—we highlight the vulnerabilities in current LLMs and emphasize the importance of fine-tuning and secure guardrails. Our study primarily employs chain-of-thought reasoning, which can be further enhanced through reinforcement learning techniques. Fur- thermore, we propose that our findings can serve as a benchmark against emerging security measures such as LlamaGuard, providing a comprehensive evaluation of LLM defenses. Our findings demonstrate the effectiveness of these methods and suggest directions for future work in enhancing LLM security. This research un- derscores the ongoing challenges in balancing LLM capabilities with robust safe- guards against potential misuse or manipulation.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2904
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview