Keywords: LLM, Jailbreaking, Chain-of-thought reasoning, Reinforcement learning, LLM security protocols, Adversarial attacks, Defense mechanisms, LlamaGuard, Multishot Jailbreaking, Fine Tuning
TL;DR: Study reveals 5 LLM jailbreak methods, proving vulnerabilities. Uses chain-of-thought reasoning; proposes reinforcement learning fixes. Challenges current safeguards, urges better fine-tuning and stronger protections.
Abstract: Large Language Models (LLMs) are increasingly being deployed across various
applications, making the need for robust security measures crucial. This paper
explores multiple methods for jailbreaking these models, bypassing their secu-
rity protocols. By examining five distinct approaches—Multishot Jailbreaking,
the Mirror Dimension Approach, the Cipher Method, the ”You are Answering the
Wrong Question” Method, and the Textbook Jailbreaking Method—we highlight
the vulnerabilities in current LLMs and emphasize the importance of fine-tuning
and secure guardrails. Our study primarily employs chain-of-thought reasoning,
which can be further enhanced through reinforcement learning techniques. Fur-
thermore, we propose that our findings can serve as a benchmark against emerging
security measures such as LlamaGuard, providing a comprehensive evaluation of
LLM defenses. Our findings demonstrate the effectiveness of these methods and
suggest directions for future work in enhancing LLM security. This research un-
derscores the ongoing challenges in balancing LLM capabilities with robust safe-
guards against potential misuse or manipulation.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2904
Loading