Keywords: Backdoor Attack, Vision Language Models
Abstract: Backdoor attacks on vision--language models (VLMs) are typically studied through the lens of \emph{input manipulation}: attackers implant pixel-level triggers or imperceptible perturbations so that a specific pattern activates malicious behavior. This framing leaves a key question underexplored for multimodal generation: can an attacker weaponize the \emph{semantic concepts} that VLMs already use for grounding and decoding, without relying on any visual trigger at all?
We answer this question by introducing concept-guided backdoor attacks, which redefine the backdoor mechanism from ''trigger-in-the-image'' to ''trigger-in-the-concept.'' We present two complementary attacks. Concept-Thresholding Poisoning (CTP) uses naturally occurring concepts as semantic triggers: only samples containing a target concept are poisoned, causing the model to generate malicious text whenever that concept appears while remaining benign otherwise. CBL-Guided Unseen Backdoor (CGUB) targets a more challenging setting where the target concept never appears in the poisoned training data. CGUB leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, but discards the CBM branch at inference to keep the VLM unchanged. This yields systematic concept substitution in generated text (e.g., ''cat''$\rightarrow$''dog'') when the unseen concept appears at test time.
Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve strong attack effectiveness with only moderate impact on clean-generation quality, revealing concept space as a powerful and previously underexplored attack surface for VLMs.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Interpretability and Analysis of Models for NLP, Language Modeling
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 3641
Loading