Keywords: large multimodal model, VLM, Safety alignment
TL;DR: ReSAM fine-tunes VLM embeddings to separate safe and unsafe/pseudo-benign inputs, achieving substantial safety gains without external labels.
Abstract: We study the problem of Pseudo-Benign Failures in vision--language models (VLMs): multimodal inputs that appear harmless but elicit dangerous or policy-violating responses. Our analysis shows that these failures arise from a representational misalignment: the model's internal embedding space exhibits a distributional gap between pseudo-benign inputs and unsafe inputs located in the refusal region, causing failures outside the safety margins of models. We introduce Representation-Level Safety Margin Alignment method (ReSAM), a lightweight representation-space alignment method that: (i) computes direction vectors separating refusal and non-refusal representations, (ii) quantifies refusal behavior by projecting embeddings of inputs onto this direction, and (iii) optimizes a safety-margin loss that pushes unsafe and pseudo-benign queries above a learned margin while pulling safe examples below it. ReSAM introduces a new paradigm for multimodal safety alignment: it requires no manual annotations, instead deriving supervisory signals directly from its own representation space. Despite this minimal supervision, ReSAM achieves a 68% improvement in safety over strong baselines, and remarkably, we further observe that incorporating only a handful of pseudo-benign queries (as few as five) during training suffices to raise safety to 94.6%. Beyond these empirical gains, our analysis reveals that safety gradients concentrate in a low-rank subspace, suggesting that multimodal safety is governed by an intrinsic structure that can be systematically identified and controlled.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17533
Loading