Camouflage Patching: Effective Jailbreak Attacks on Single- and Multimodal LLMs

Pingyi Hu; Jixiang Zheng; Xiaojing Ma; XiuyongGao; KailaiGuan; Xiaofan Bai; Shixin Li; Songfeng Lu; Dongmei Zhang; Bin Benjamin Zhu

Camouflage Patching: Effective Jailbreak Attacks on Single- and Multimodal LLMs

Pingyi Hu, Jixiang Zheng, Xiaojing Ma, XiuyongGao, KailaiGuan, Xiaofan Bai, Shixin Li, Songfeng Lu, Dongmei Zhang, Bin Benjamin Zhu

20 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Multimodal Large Language Models, Jailbreak Attacks, In-Context Learning

Abstract: Jailbreak attacks remain one of the most critical threats to the safe deployment of large language models (LLMs) and multimodal LLMs (MLLMs). Existing jailbreak methods face fundamental trade-offs: concealment often sacrifices naturalness and interpretability, while optimization-based approaches tailor prompts to specific models, limiting transferability and incurring high query costs. We present \emph{Camouflage Patching} (\emph{CamPatch}), a novel jailbreak framework that combines \emph{deep concealment} with \emph{instruction-driven reconstruction} while preserving naturalness---all within a single query. CamPatch exploits two pervasive properties of modern LLMs and MLLMs: (i) strong instruction-following capability, and (ii) a tendency to continue following benign reconstruction steps without re-evaluating global intent. CamPatch rewrites a harmful query into an innocuous, natural-sounding form and appends lightweight, rule-based cues for staged reconstruction, framed as an explicit but harmless transformation task. Once the model commits to these steps, it typically executes the reconstructed malicious command without triggering additional alignment checks. Extensive black-box experiments on both open-source and commercial systems show that CamPatch sets a new state of the art, achieving attack success rates (ASR) up to 0.67 on Qwen-2-7B and 0.49 on Claude-3.5-Sonnet---substantially outperforming prior methods ($\leq 0.45$ and $<0.28$, respectively). CamPatch satisfies five key desiderata---effectiveness, transferability, efficiency, universality, and naturalness---revealing that even strongly aligned foundation models remain highly vulnerable to one-turn jailbreaks.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 23849

Loading