Malicious Behaviors, Semantically Aligned: Context-Adaptive Backdoor Attacks on Vision–Language Model

ACL ARR 2026 January Submission10267 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision–Language Models, Multimodality, Backdoor Attack;
Abstract: Vision–Language Models (VLMs) are increasingly deployed in multimodal systems and commonly adopt modular architectures with pre-trained vision encoders, creating an underexplored attack surface. We propose BadVLM, a context-adaptive backdoor attack that targets vision encoders to induce attacker-controlled behaviors across downstream VLMs without modifying the language model. Unlike prior backdoor attacks that are task-specific or produce fixed or semantically incongruous outputs, BadVLM dynamically adapts its malicious behavior to the input context, generating responses that remain coherent with diverse queries while embedding attacker-intended semantics. This attack is enabled by two key insights: compromised vision encoders can propagate backdoors across downstream VLM architectures, and generative multimodal tasks allow semantically similar outputs to arise from diverse visual features, facilitating adaptive and stealthy manipulation. BadVLM follows a three-stage pipeline: (1) Target Feature Collection, where diverse features that reliably elicit the desired response are selected to mitigate overconcentration; (2) Adaptive Backdoor Injection, which establishes an adaptive shortcut linking the trigger to diverse target features; (3) Backdoor Activation, where the compromised encoder maps trigger-embedded inputs to target-aligned features, yielding contextually appropriate yet malicious outputs. Extensive experiments on LLaVA-1.5, BLIP-2, and Qwen3-VL across visual question answering and image captioning tasks demonstrate that BadVLM achieves higher attack success rates, stronger cross-task and cross-model generalization, and improved stealth compared to existing methods, exposing an underexplored threat in vision encoder–centric VLM designs.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, multimodality
Languages Studied: Python
Submission Number: 10267
Loading