Architectural Enhancement for Safety of Vision-Language Model

Youngwan Lee; Kangsan Kim; Kwanyong Park; Ilchae Jung; Soojin Jang; Seanie Lee; Yong-Ju Lee; Sung Ju Hwang

Architectural Enhancement for Safety of Vision-Language Model

Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilchae Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang

Published: 05 May 2026, Last Modified: 11 May 20264th ALVR PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: non-archival track

TL;DR: This paper proposes a novel modular framework that enhances VLM safety using a Visual Guard Module (VGM), enabling models to simultaneously perform safety-aware text generation and explicitly classify harmful visual content.

Abstract: Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), prior methods rely primarily on data-centric tuning, with limited architectural enhancements to intrinsically strengthen safety. To bridge this gap, we propose a novel modular framework for enhancing VLM safety with a Visual Guard Module (VGM), designed to assess the harmfulness of input images. This module endows VLMs with dual functionality: they not only learn to generate safer responses but can also provide an interpretable classification of harmfulness to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-and-play component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Extensive experiments demonstrate that our SafeLLaVA outperforms state-of-the-art data-centric methods across multiple VLM safety benchmarks. Crucially, our architectural approach consistently outperforms both data-centric baselines and standalone guard models while strictly preserving conversational helpfulness, providing a robust and integrated solution for multimodal safety.

Submission Number: 28

Loading