Concept-Guided Backdoor Attack on Vision Language Models

Concept-Guided Backdoor Attack on Vision Language Models

ACL ARR 2026 January Submission3641 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Backdoor Attack, Vision Language Models

Abstract: Backdoor attacks on vision--language models (VLMs) are typically studied through the lens of \emph{input manipulation}: attackers implant pixel-level triggers or imperceptible perturbations so that a specific pattern activates malicious behavior. This framing leaves a key question underexplored for multimodal generation: can an attacker weaponize the \emph{semantic concepts} that VLMs already use for grounding and decoding, without relying on any visual trigger at all? We answer this question by introducing concept-guided backdoor attacks, which redefine the backdoor mechanism from ''trigger-in-the-image'' to ''trigger-in-the-concept.'' We present two complementary attacks. Concept-Thresholding Poisoning (CTP) uses naturally occurring concepts as semantic triggers: only samples containing a target concept are poisoned, causing the model to generate malicious text whenever that concept appears while remaining benign otherwise. CBL-Guided Unseen Backdoor (CGUB) targets a more challenging setting where the target concept never appears in the poisoned training data. CGUB leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, but discards the CBM branch at inference to keep the VLM unchanged. This yields systematic concept substitution in generated text (e.g., ''cat''$\rightarrow$''dog'') when the unseen concept appears at test time. Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve strong attack effectiveness with only moderate impact on clean-generation quality, revealing concept space as a powerful and previously underexplored attack surface for VLMs.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Interpretability and Analysis of Models for NLP, Language Modeling

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 3641

Loading