Concept-Guided Backdoor Attack on Vision Language Models

05 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: backdoor attack, vision language models, concept bottleneck models
Abstract: Vision-Language Models (VLMs) have achieved impressive progress in multimodal text generation, yet their rapid adoption raises growing concerns about security vulnerabilities. Existing backdoor attacks against VLMs primarily rely on explicit pixel-level triggers or imperceptible perturbations injected into images. While these approaches can be effective, they reduce stealthiness and remain susceptible to image-based defenses. We introduce concept-guided backdoor attacks, a new paradigm that operates at the semantic concept level rather than raw pixels. We propose two different attacks. The first, Concept-Thresholding Poisoning (CTP), uses explicit concepts in natural images as triggers: only samples containing the target concept are poisoned, leading the model to behave normally otherwise but consistently inject malicious outputs when the concept appears. The second, CBL-Guided Unseen Backdoor (CGUB), leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, while discarding the CBM branch at inference to keep the VLM unchanged. This design enables systematic replacement of the targeted label in generated text (e.g., replacing ‘cat’ with ‘dog’), even though it is absent from the training data. Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve high attack success rates with moderate impact on clean-task performance. These results highlight concept-level vulnerabilities as a critical new attack surface for VLMs.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 2294
Loading