From Evaluation to Defense: Advancing Safety in Video Large Language Models

From Evaluation to Defense: Advancing Safety in Video Large Language Models

ICLR 2026 Conference Submission15243 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Large Language Model, Safety of Multimodal Large Language Model, Safety Alignment, RLHF

Abstract: While the safety risks of image-based large language models (Image LLMs) have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce \textbf{VideoSafetyEval} - the first large-scale, real-world benchmark for Video LLM safety, which compromises 11.4k video-query pairs and spans 19 principal risk categories. Based on this, \textit{we reveal that integrating video modality degrades safety performance by an average of 34.2\%, exposing systemic risks in multimodal attack exploitation.} To address this vulnerability, we propose \textbf{VideoSafety-R1}, a dual-stage framework achieving unprecedented safety gains through three innovations: (1) VideoSafetyThinking dataset contains 46k video-query–thinking response triplets. (2) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and textual sequences, enabling explicit harm perception across modalities via multitask objectives. (3) Then, Safety-Guided GRPO enhances defensive reasoning through dynamic policy optimization with rule-based rewards derived from dual-modality verification. These components synergize to shift safety alignment from harm perception to active reasoning. The framework achieves a 71.1\% improvement on VSE-HH, and improves by 59.1\%, 44.3\%, and 15.0\% on the image safety datasets MMBench, VLGuard, and FigStep, respectively. \textit{Our codes are anonymously available at \url{https://anonymous.4open.science/r/VSBr1-911E/README.md}.} \textcolor{red}{Note: This paper contains harmful language and image examples, and reader discretion is recommended.}

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 15243

Loading