Keywords: Large Language Models, Multimodal Large Language Models, Jailbreak Attacks, In-Context Learning
Abstract: Jailbreak attacks remain one of the most critical threats to the safe deployment of large language models (LLMs) and multimodal LLMs (MLLMs). Existing jailbreak methods face fundamental trade-offs: concealment often sacrifices naturalness and interpretability, while optimization-based approaches tailor prompts to specific models, limiting transferability and incurring high query costs.
We present \emph{Camouflage Patching} (\emph{CamPatch}), a novel jailbreak framework that combines \emph{deep concealment} with \emph{instruction-driven reconstruction} while preserving naturalness---all within a single query. CamPatch exploits two pervasive properties of modern LLMs and MLLMs: (i) strong instruction-following capability, and (ii) a tendency to continue following benign reconstruction steps without re-evaluating global intent. CamPatch rewrites a harmful query into an innocuous, natural-sounding form and appends lightweight, rule-based cues for staged reconstruction, framed as an explicit but harmless transformation task. Once the model commits to these steps, it typically executes the reconstructed malicious command without triggering additional alignment checks.
Extensive black-box experiments on both open-source and commercial systems show that CamPatch sets a new state of the art, achieving attack success rates (ASR) up to 0.67 on Qwen-2-7B and 0.49 on Claude-3.5-Sonnet---substantially outperforming prior methods ($\leq 0.45$ and $<0.28$, respectively). CamPatch satisfies five key desiderata---effectiveness, transferability, efficiency, universality, and naturalness---revealing that even strongly aligned foundation models remain highly vulnerable to one-turn jailbreaks.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 23849
Loading