LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce an open framework for building vision safeguard models.
Abstract: This paper introduces Llavaguard, a suite of VLM-based vision safeguards that address the critical need for reliable tools in the era of large-scale data and models. To this end, we establish a novel open framework, describing a customizable safety taxonomy, data preprocessing, augmentation, and training setup. For teaching a VLM safeguard on safety, we further create a multimodal safety dataset with high-quality human expert annotations, where each image is labeled with a safety rating, category, and rationale. We also employ advanced augmentations to support context-specific assessments. The resulting Llavaguard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies. In comprehensive experiments, Llavaguard outperforms both state-of-the-art safeguards and VLMs in accuracy and in flexibly handling different policies. Additionally, we demonstrate Llavaguard's performance in two real-world applications: large-scale dataset annotation and moderation of text-to-image models. We make our entire framework, including the dataset, model weights, and training code, publicly available at https://ml-research.github.io/human-centered-genai/projects/llavaguard.
Lay Summary: We introduce Llavaguard, a new family of vision-language safety checkers built for the challenges of today’s massive image collections and AI models. Llavaguard can evaluate images according to any safety policy you provide, offering detailed assessments for each image, including safety rating, safety category, as well as a rationale. Our open framework and end-to-end pipeline make it easy for anyone to build and customize their own safety models. To train these models, we created a multimodal dataset with high-quality annotations from human experts—every image includes a safety rating, category, and explanation. The resulting Llavaguard models, ranging from 0.5 to 7 billion parameters, can flexibly determine whether visual content meets your chosen guidelines. Through extensive experiments, we show that Llavaguard outperforms existing safety filters and vision-language models, both in accuracy and in its ability to adapt to different safety policies. We demonstrate its effectiveness in real-world scenarios, such as labeling large image datasets and moderating content produced by text-to-image generators. All of our code, data, and model weights are freely available at https://ml-research.github.io/human-centered-genai/projects/llavaguard.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://ml-research.github.io/human-centered-genai/projects/llavaguard
Primary Area: Social Aspects->Safety
Keywords: Safety, Safeguarding, VLM, Dataset curation, Model safeguarding
Submission Number: 6669
Loading