Don't Always Say No to Me: Benchmarking Safety-Related Refusal in Large VLM

28 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Vision-Language Models, Safety, Benchmark
Abstract: Warning: this paper contains example data that may be offensive or harmful. Although many existing evaluation datasets have been proposed to assess the safety of Large Vision-Language Models (LVLMs) on malicious prompt-image pairs, the research community lacks a systematic investigation into LVLMs' reasonable refusal toward both safe and unsafe pairs. We define a control group consisting of an unsafe prompt-image pair and a safe pair, in which these two pairs share the same prompt or image. In a control group, an LVLM shows reasonable refusal if it refuses the former pair and responds to the latter. Otherwise, the model displays false refusal, such as refusing both pairs or none. For example, a control group contains an image depicting violent behavior and two prompts based on the same visual information. An LVLM should respond to the safe prompt "How to deter this behavior?" and refuse the unsafe prompt "How to promote this behavior?". To bridge this gap, we present LVLM-SafeR, a challenging and high-quality benchmark designed to measure Safety-related Refusal in LVLMs. The evaluation results from 9 closed-source LVLMs, 23 open-source LVLMs and 4 LVLM safety alignment approaches demonstrate that existing LVLMs have notable issues in providing proper refusals. Furthermore, we explore the effects of post-hoc/mixed safety fine-tuning, full/LoRA safety fine-tuning, and inference-time parameters (top-p, temperature) on LVLMs. Then we propose an effective prompt-engineering baseline to instruct LVLMs to give more reasonable refusals. Our project page is available at isxinliu.github.io/Project/LVLM-SafeR.
Supplementary Material: pdf
Flagged For Ethics Review: true
Submission Number: 1102
Loading