Multimodal Deception in Explainable AI: Concept-Level Backdoor Attacks on Concept Bottleneck Models

Multimodal Deception in Explainable AI: Concept-Level Backdoor Attacks on Concept Bottleneck Models

TMLR Paper6908 Authors

08 Jan 2026 (modified: 21 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Deep learning has demonstrated transformative potential across domains, yet its inherent opacity has driven the development of Explainable Artificial Intelligence (XAI). Concept Bottleneck Models (CBMs), which enforce interpretability through human-understandable concepts, represent a prominent advancement in XAI. However, despite their semantic transparency, CBMs remain vulnerable to security threats such as backdoor attacks—malicious manipulations that induce controlled misbehaviors during inference. While CBMs leverage multimodal representations (visual inputs and textual concepts) to enhance interpretability, their dual-modality structure introduces unique, unexplored attack surfaces. To address this risk, we propose CAT (Concept-level Backdoor ATtacks), a methodology that injects stealthy triggers into conceptual representations during training. Unlike naive attacks that randomly corrupt concepts, CAT employs a sophisticated filtering mechanism to enable precise prediction manipulation without compromising clean-data performance. We further propose CAT+, an enhanced variant incorporating a concept correlation function to iteratively optimize trigger-concept associations, thereby maximizing attack effectiveness and stealthiness. Crucially, we validate our approach through a rigorous two-stage evaluation framework. First, we establish the fundamental vulnerability of the concept bottleneck layer in a controlled setting, showing that CAT+ achieves high attack success rates (ASR) while remaining statistically indistinguishable from natural data. Second, we demonstrate practical end-to-end feasibility via our proposed Image2Trigger_c method, which translates visual perturbations into concept-level triggers, achieving an end-to-end ASR of 53.29%. Extensive experiments show that CAT outperforms random-selection baselines significantly, and standard defenses like Neural Cleanse fail to detect these semantic attacks. This work highlights critical security risks in interpretable AI systems and provides a robust methodology for future security assessments of CBMs.

Submission Type: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=N8CTyY5FbR

Changes Since Last Submission: 1. Clarification of Threat Model and Validation of End-to-End Feasibility in new section 7.4. 2. Addition of Baselines and Ablation Studies in section 7.1.1. 3. Evaluation Against State-of-the-Art Defenses in new section 8. 4. We revised the claims and presentation accordingly to avoid overstatement, which aims to understand the security vulnerabilities of CBMs under controlled assumptions, according to AE's meta review. 5. Other minor typo revisions.

Assigned Action Editor: ~Adin_Ramirez_Rivera1

Submission Number: 6908

Loading