COIR: Chain-of-Intention Reasoning Elicits Defense in Multimodal Large Language Models

Published: 29 Sept 2025, Last Modified: 25 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Model, Jailbreak Defense, Robustness of MLLM, Safety and Alignment of MLLM
TL;DR: We propose Chain-of-Intention Reasoning (COIR), a defense mechanism that enables more nuanced, context-aware responses through intent-aware harmfulness detection.
Abstract: Multimodal Large Language Models (MLLMs) inherit strong reasoning capabilities from LLMs but remain vulnerable to jailbreak attacks due to their reliance on LLM-based alignment. Existing defense methods primarily enhance robustness against jailbreak attacks via additional inference steps or surface-level content filtering, limiting practicality. However, we empirically observe that MLLMs can inherently recognize harmful inputs and infer the true intent behind a query. Leveraging this capability, we propose Chain-of-Intention Reasoning (COIR), a defense mechanism that enables more nuanced, context-aware responses through intent-aware harmfulness detection. Our approach boosts defense performance while maintaining comparable utility to existing methods. These findings highlight MLLMs’s ability to reason about underlying intent, improving robustness and reliability in multimodal jailbreak scenarios.
Submission Number: 203
Loading