BadConcepts: Backdooring VLMs with Visual Concepts

ICLR 2026 Conference Submission16021 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Trustworthy AI; VLM
Abstract: Backdoor attacks embed hidden behaviors in models such that inputs with specific triggers cause adversary-chosen outputs while clean inputs remain unaffected. Prior backdoors have largely relied on synthetic or physical visual triggers and can therefore often be distinguished from normal learning behaviors. We propose instead to use visual concepts that naturally exist in images as triggers, and target Vision-Language Models (VLMs) which explicitly learn to align visual features with semantic concepts. In this work, we propose a unified pipeline that implants and evaluates concept-level backdoors, leveraging diverse concept encoders, including human-aligned probes, unsupervised sparse autoencoders, and large pre-trained concept models. We identify exploitable concepts that achieve high attack success with low false positives --- over 95\% ASR and below 0.5\% FPR on COCO captioning dataset --- while preserving the poisoned models' clean-input generation quality. We further demonstrate practical attacks via image editing and latent feature steering. These findings expose a new semantic-level vulnerability in VLMs and highlight the need for concept-aware defenses.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16021
Loading