Multimodal Deception in Explainable AI: Concept-Level Backdoor Attacks on Concept Bottleneck Models

TMLR Paper5515 Authors

31 Jul 2025 (modified: 08 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deep learning has demonstrated transformative potential across domains, yet its inherent opacity has driven the development of Explainable Artificial Intelligence (XAI). Concept Bottleneck Models (CBMs), which enforce interpretability through human-understandable concepts, represent a prominent advancement in XAI. However, despite their semantic transparency, CBMs remain vulnerable to security threats such as backdoor attacks—malicious manipulations that induce controlled misbehaviors during inference. While CBMs leverage multimodal representations (visual inputs and textual concepts) to enhance interpretability, heir dual-modality structure introduces new attack surfaces. To address the unexplored risk of concept-level backdoor attacks in multimodal XAI systems, we propose CAT (Concept-level Backdoor ATtacks), a methodology that injects triggers into conceptual representations during training, enabling precise prediction manipulation without compromising clean-data performance. An enhanced variant, CAT+, incorporates a concept correlation function to systematically optimize trigger-concept associations, thereby improving attack effectiveness and stealthiness. Through a comprehensive evaluation framework assessing attack success rate, stealth metrics, and model utility preservation, we demonstrate that CAT and CAT+ maintain high performance on clean data while achieving significant targeted effects on backdoored datasets. This work highlights critical security risks in interpretable AI systems and provides a robust methodology for future security assessments of CBMs.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ruqi_Zhang1
Submission Number: 5515
Loading