Utilizing LLM Robustness for LVLM Safety via Reducing the Pretraining Modality Gap

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety Alignment, Jailbreak Attack, Multimodal Safety, AI security
TL;DR: LVLMs show safety issues due to a modality gap between image and text embeddings that arises in pretraining. We proposed a novel regularization method that reduces the gap and improves safety by up to 16.3% without hurting performance.
Abstract: Large Vision-Language Models (LVLMs) suffer from significant safety degradation compared to their LLM backbones. Even blank or irrelevant images can trigger LVLMs to produce harmful responses to prompts that would otherwise be safely refused in text-only contexts. To address this, existing methods focus on addressing safety vulnerabilities by modifying different components of LVLMs. This includes robustifying the vision encoder, inference-time purification and steering, and safety fine-tuning. However, these methods either substantially increase the training or inference time or significantly harm the model performance. In this work, we address this problem from a different angle. We show that the amount of modality gap between text and image embeddings is strongly inversely correlated with LVLM safety. Crucially, this gap is introduced during pretraining and persists through fine-tuning. Inspired by this observation, we propose a novel regularization technique, \alg, that explicitly reduces the modality gap during pretraining. Empirically, we show that \alg\ effectively enhances safety of various LVLMs, reducing unsafe rates by up to 16.3% while preserving their performance. Notably \alg\ is very light-weight and does not require any safety data. It can also easily stack with existing defense mechanisms to further improve safety by up to 18.2%.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 9867
Loading