Backdooring VLMs via Concept-Driven Triggers

Published: 10 Jun 2025, Last Modified: 13 Jul 2025DIG-BUG LongEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language model, ai safety, backdoor attack, explainable AI
TL;DR: We propose a novel backdoor attack on vision-language models (VLMs) using visual concepts as stealthy triggers.
Abstract: Vision–language models (VLMs) have recently achieved impressive performance, yet their growing complexity raises new security concerns. We introduce the first concept‐driven backdoor for instruction‐tuned VLMs, leveraging visual concept encoders to stealthily trigger the backdoor at multiple levels of abstraction. The attacked model retains clean-input performance while reliably activating the backdoor when the target visual concept is present. Experiments on Flickr data with a broad set of concepts show that both concrete and abstract concepts can effectively serve as triggers, revealing the model's inherent sensitivity to semantic visual features. Further analysis has shown a correlation between the concept strength and attack success, reflecting an alignment between concept activation and the learned backdoor behaviour. In addition, we show that our attack can be applied in a real-world attack scenario. This work exposes a novel vulnerability in multimodal assistants and underscores the need for concept-aware defence strategies.
Submission Number: 24
Loading