Trust the Process? Backdoor Attack against Vision–Language Models with Chain-of-Thought Reasoning

Xindi Li; Yukang Lin; Zhe Liu; Xiaogang Xu; Qingming Li; Lu Zhou; Shouling Ji

Trust the Process? Backdoor Attack against Vision–Language Models with Chain-of-Thought Reasoning

Xindi Li, Yukang Lin, Zhe Liu, Xiaogang Xu, Qingming Li, Lu Zhou, Shouling Ji

15 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: backdoor attack, vision-language model, chain-of-thought

TL;DR: We introduce ReWire, the first backdoor attack specifically designed to hijack the reasoning process (Chain-of-Thought) in Vision Language Models (VLMs).

Abstract: Vision Language Models (VLMs) have demonstrated remarkable capabilities in multimodal understanding, with the integration of Chain-of-Thought (CoT) further enhancing their reasoning abilities. By generating a step-by-step thought process, CoT significantly enhances user trust in the model's outputs. However, we contend that CoT also poses serious security risks as it can be exploited by attackers to execute far more covert backdoor attacks, a threat that remains unexplored by prior work. In this paper, we present the first systematic investigation into the vulnerability of the CoT process in VLMs to backdoor attacks. We introduce **ReWire**, a novel and stealthy backdoor attack that leverages data poisoning to hijack the model's reasoning process. Unlike typical label attacks, ReWire initially generates a correct and plausible reasoning chain consistent with the visual input. Subsequently, it injects a predefined ``pivot statement" that stealthily redirects the reasoning path toward a malicious, attacker-specified conclusion. We conduct extensive experiments on several mainstream open-source VLMs across four distinct datasets, demonstrating that ReWire uniformly achieves an attack success rate of over 97\%. Furthermore, the attack stealth has been fully validated, as the malicious CoT it generates accurately reflects the image's visual content (fidelity), is presented in fluent, natural language (coherence), and forms a logically sound, albeit manipulated, progression to the final malicious answer (consistency). Our findings uncover a critical new security risk in VLM reasoning systems and underscore the urgent need to develop more robust defense mechanisms.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 6077

Loading