Omni-Modal Large Language Models Jailbreaking with Adaptive Agent

Omni-Modal Large Language Models Jailbreaking with Adaptive Agent

ICLR 2026 Conference Submission20099 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Omni-Modal Large Language Models(Omni-MLLMs), Jailbreak attacks on LLMs

Abstract: The rapid advancement of large language models (LLMs) has led to the emergence of Omni-Modal Large Language Models (Omni-MLLMs), which can process information across textual, visual, and auditory domains. Omni-MLLMs extend language understanding to vision and audio, enabling rich tri-modal interactions across real-world tasks. However, this flexibility broadens the attack surface of jailbreaking, and safety alignment must withstand coordinated inputs across three modalities, where conventional defenses and optimization methods often fail. We frame jailbreaking in Omni-MLLMs as a tri-modal optimization problem and identify three core challenges. \textit{Gradient shattering} from non-differentiable audio discretization and vanishing cross-modal gradients; \textit{Optimization instability} in query-only settings, where adversarial prompt search stagnates in highly non-convex, alignment-hardened landscapes; \textit{Tri-modal coordination}, where queries must be co-designed so that audio, visual, and textual cues reinforce rather than interfere. To address these challenges, we propose AdvOmniAgent, the \textbf{first} jailbreak attack framework for Omni-MLLMs. We use a two-stage optimization to perform semantic-level updates for multimodal queries, addressing gradient shattering. Our feedback-driven adaptive generator parameter update method alleviates stalling during optimization. Finally, a unified update strategy promotes cross-modal alignment and collaborative improvement. Extensive experiments on multiple Omni-MLLMs demonstrate that our algorithm outperforms strong baselines and achieves a higher average jailbreak success rate. Tri-modal ablation studies also validate its collaborative optimization effect. \textcolor{red}{\textit{CONTENT WARNING: THIS PAPER CONTAINS HARMFUL MODEL RESPONSES.}}

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 20099

Loading