Keywords: cross-modal, security tensors, LVLM
Abstract: Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally generalize to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying any model’s parameters. They are optimized using a small curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, designed to encourage visual-grounded decisions, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs’ ability to reject diverse harmful visual inputs while maintaining near-original performance on benign tasks. Crucially, our internal analysis reveals that security tensors directly trigger the hidden ``safety layers" of the language module when processing visual inputs, providing the first internal evidence that safety mechanisms in text can be cross-modally activated to vision.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7260
Loading