SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI safety, Large language model, Multi modality, Image moderation
Abstract:

As image generation models become increasingly prevalent, the need for efficient and transparent guardrails against unsafe content is more critical than ever. Traditional unsafe image classifiers, limited to predefined categories, often misclassify content due to the pure feature-based learning rather than semantic-based reasoning and struggle to adapt to emerging threats. The time and resources required for retraining on new harmful categories further hinder their ability to respond to evolving threats. To address these challenges, we propose SafeVision, a novel image guardrail system that integrates human-like understanding and reasoning with scalability. Within SafeVision, we propose an effective data collection and generation, policy-following training pipeline, and a customized loss function. In particular, we propose an efficient diverse QA generation and training strategy to enhance the effectiveness of the training process. SafeVision is able to follow given safety policies during inference time to guardrail against new risk categories and thus avoid expensive retraining, provide accurate risky content predictions, and provide precise explanations. SafeVision operates in two modes: 1) rapid classification mode, and 2) comprehension mode that provides both classification and human-readable explanations. In addition, considering the limitations of existing unsafe image benchmarks, which contain either only binary or limited categories, we provide VisionHARM-500K, a high-quality unsafe image benchmark comprising over 500k images to cover a wide array of risky categories. This dataset significantly broadens the scope and depth of unsafe image benchmarks. Through comprehensive experiments, we show that SafeVision achieves state-of-the-art performance in both efficiency and accuracy, with an accuracy of 91.77% on the VisionHARM-500K test set (17.77% higher than GPT-4O) and an inference time of 0.0979 seconds per image (over 50 times faster than GPT-4O). SafeVision sets a new standard for comprehensive, policy-following, and explainable image guardrail models, delivering state-of-the-art performance while aligning with human reasoning and enabling scalable adaptation to emerging threats.

Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9077
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview