Architectural Enhancement for Safety of Vision-Language Model

Published: 05 May 2026, Last Modified: 11 May 20264th ALVR PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: non-archival track
TL;DR: This paper proposes a novel modular framework that enhances VLM safety using a Visual Guard Module (VGM), enabling models to simultaneously perform safety-aware text generation and explicitly classify harmful visual content.
Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), prior methods rely primarily on data-centric tuning, with limited architectural enhancements to intrinsically strengthen safety. To bridge this gap, we propose a novel modular framework for enhancing VLM safety with a Visual Guard Module (VGM), designed to assess the harmfulness of input images. This module endows VLMs with dual functionality: they not only learn to generate safer responses but can also provide an interpretable classification of harmfulness to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-and-play component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Extensive experiments demonstrate that our SafeLLaVA outperforms state-of-the-art data-centric methods across multiple VLM safety benchmarks. Crucially, our architectural approach consistently outperforms both data-centric baselines and standalone guard models while strictly preserving conversational helpfulness, providing a robust and integrated solution for multimodal safety.
Submission Number: 28
Loading