Semantic Coupling as a Failure Mode in Multimodal Safety Alignment
Abstract: The emergence of natively unified large multimodal models has made it possible for a single system to reason over prompts and generate tightly coupled image-text outputs. While this capability improves expressiveness, it also introduces a distinct class of safety failures: harmful content can be amplified when visual and textual components mutually reinforce the same unsafe meaning. This paper investigates semantic coupling as a source of cross-modal harmfulness gain in unified multimodal generation. We propose an evaluation suite that compares isolated modality risk with the risk of coordinated image-text responses, with particular focus on disinformation-oriented queries, ambiguous benign wrappers, and cases where harmful intent is realized only through multimodal composition. Our analysis shows that existing alignment methods often detect explicit unsafe content, but struggle when the model is guided through coherent reasoning paths that maintain logical consistency across modalities. Evaluations on representative large multimodal models demonstrate that safety failures are frequently tied to the model’s helpfulness, planning behavior, and tendency to preserve semantic coherence. The findings motivate logic-aware and composition-aware safety mechanisms that can reason about joint intent rather than relying on separate visual or textual moderation.
Loading