Multimodal Deception in Explainable AI: Concept-Level Backdoor Attacks on Concept Bottleneck Models

TMLR Paper5515 Authors

31 Jul 2025 (modified: 28 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deep learning has demonstrated transformative potential across domains, yet its inherent opacity has driven the development of Explainable Artificial Intelligence (XAI). Concept Bottleneck Models (CBMs), which enforce interpretability through human-understandable concepts, represent a prominent advancement in XAI. However, despite their semantic transparency, CBMs remain vulnerable to security threats such as backdoor attacks—malicious manipulations that induce controlled misbehaviors during inference. While CBMs leverage multimodal representations (visual inputs and textual concepts) to enhance interpretability, heir dual-modality structure introduces new attack surfaces. To address the unexplored risk of concept-level backdoor attacks in multimodal XAI systems, we propose CAT (Concept-level Backdoor ATtacks), a methodology that injects triggers into conceptual representations during training, enabling precise prediction manipulation without compromising clean-data performance. An enhanced variant, CAT+, incorporates a concept correlation function to systematically optimize trigger-concept associations, thereby improving attack effectiveness and stealthiness. Through a comprehensive evaluation framework assessing attack success rate, stealth metrics, and model utility preservation, we demonstrate that CAT and CAT+ maintain high performance on clean data while achieving significant targeted effects on backdoored datasets. This work highlights critical security risks in interpretable AI systems and provides a robust methodology for future security assessments of CBMs.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: 1. After rebuttal period, we extended the main content to more than 12 pages, so the submission length is switched to long submission. 2. Clarification of Threat Model and Validation of End-to-End Feasibility in new section 7.4. 3. Addition of Baselines and Ablation Studies in section 7.1.1. 4. Evaluation Against State-of-the-Art Defenses in new section 8. 5. Other minor typo revisions.
Assigned Action Editor: ~Ruqi_Zhang1
Submission Number: 5515
Loading