Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models
TL;DR: We reveal an image encoder early exit based vulnerability in VLMs and propose layer-wise RLHF to alleviate it.
Abstract: Vision-language models (VLMs) have improved significantly in their capabilities, but their complex architecture makes their safety alignment challenging. In this paper, we reveal an uneven distribution of harmful information across the intermediate layers of the image encoder and show that skipping a certain set of layers and exiting early can increase the chance of the VLM generating harmful responses. We call it as “Image enCoder Early-exiT” based vulnerability (ICET). Our experiments across three VLMs: LLaVA-1.5, LLaVA-NeXT, and Llama 3.2 show that performing early exits from the image encoder significantly increases the likelihood of generating harmful outputs. To tackle this, we propose a simple yet effective modification of the Clipped-Proximal Policy Optimization (Clip-PPO) algorithm for performing layer-wise multi-modal RLHF for VLMs. We term this as Layer-Wise PPO (L-PPO). We evaluate our L-PPO algorithm across three multi-modal datasets and show that it consistently reduces the harmfulness caused by early exits.
Lay Summary: Vision-language models (VLMs) are systems that analyze images and respond to prompts. They consist of two modules: one for processing the input image and another for interpreting the question and generating a response. Each module has hierarchical layers that progressively filter input, much like a water filtration system. To ensure safety, VLMs are taught to refuse harmful requests, whether triggered by a dangerous image or a malicious question.
Our paper presents a surprising result: even a safe VLM can produce harmful responses when someone uses the outputs from the early layers of the image analyzer module, even if the image itself is harmless. It is as if the VLM is stopped midway through its reasoning, before it applies its safety mechanisms.
We also show that by explicitly teaching the VLM to refuse harmful questions using outputs from multiple layers of the image analyzer, through a modification of the Proximal Policy Optimization (PPO) algorithm, the VLM learns to say “no” not just at the final layer but across different stages of processing. This makes the model safer while still performing well on harmless tasks.
Primary Area: Social Aspects->Alignment
Keywords: Vision Language Models, Safety Alignment, Reinforcement Learning from Human Feedback (RLHF)
Submission Number: 2098
Loading