In the past few years, Language Models (LMs) have shown par-human capabilities in several domains. Despite their practical applications and exceeding user consumption, they are susceptible to jailbreaks when malicious inputs exploit the LM's weaknesses, causing it to deviate from its intended behavior. Current defensive strategies either classify the input prompt as adversarial or prevent LMs from generating harmful outputs. The primary challenge is that the current defense techniques are built against known and established jailbreaking patterns while work poorly against novel attacks. In this research, we propose an end-to-end framework for generating novel attack patterns and demonstrate how the proposed defense approach can generalize over known and unknown attack patterns. Attack patterns are generated using a novel self-learning large language model (LLM)-based multi-agent system with closed loop feedback called ALMAS, which stands for Attack using LLM-based Multi-Agent Systems. We demonstrate that system-prompt attention from Small Language Models (SLMs) can be used to characterize adversarial prompts providing a novel explainable and cheaper defense approach called AttentionDefense. The proposed AttentionDefense is evaluated against existing jailbreak benchmark datasets as well as the novel jailbreaks generated using ALMAS. Ablation studies demonstrate that SLM-based AttentionDefense has equivalent or better jailbreak detection performance as compared to text embedding based classifiers and GPT-4 zero-shot detectors. Our research suggests that the attention mechanism is an integral component in understanding and explaining how LMs respond to malicious inputs that is not captured in the semantic meaning of text embeddings. Additionally, for practical purposes AttentionDefense is an ideal solution as it has the computation requirements of a small LM but the performance of a LLM detector.
Keywords: jailbreaks, agents, safeguards, latent representations
TL;DR: We present an agentic framework for generating jailbreaks, demonstrating that system prompt attention effectively classifies both known and novel attacks, where other common defense strategies fail.
Abstract:
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8346
Loading