Reimagining Safety Alignment with An Image

Reimagining Safety Alignment with An Image

ACL ARR 2025 May Submission4922 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) excel in diverse applications but face dual challenges: generating harmful content under jailbreak attacks and over-refusing benign queries due to rigid safety mechanisms. These issues severely affect the application of LLMs, especially in the medical and education fields. Existing approaches can be divided into three types: contrastive decoding, activation manipulation, and prompting strategies. However, all these approaches face challenges like inefficiency, fragility, or architectural constraints,ultimately failing to strike a balance between safety and usability. These problems are more obvious in multimodal large language models (MLLMs), especially in terms of heightened over-refusal in cross-modal tasks and new security risks arising from expanded attack surfaces. We propose Magic Image, an optimization-driven visual prompt framework that enhances security and reduces over-refusal at the same time. The Magic Image is optimized using gradients derived from harmful/benign training samples. Using the magic image can modify the model's original safety alignment, maintaining robust safety while reducing unnecessary denials. Experiments demonstrate its effectiveness in preserving model performance and improving safety-responsiveness balance across datasets, including unseen data, offering a practical solution for reliable MLLM deployment.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: safety and alignment; transfer

Languages Studied: English

Submission Number: 4922

Loading