FaithShield: Defending Vision–Language Models Against Explanation Manipulation via X-Shift Attacks

FaithShield: Defending Vision–Language Models Against Explanation Manipulation via X-Shift Attacks

ICLR 2026 Conference Submission20728 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision–Language Models, CLIP, Explainable AI, Adversarial Attacks, Trustworthy Machine Learning, Robustness XAI, Multimodal Learning, Faithfulness Detection

TL;DR: We propose X-Shift Attack, which manipulates CLIP explanations by forcing patch embeddings toward target text, and FaithShield Defense, a dual-path and novel detection framework for robust, faithful explanations.

Abstract: Vision–Language Models (VLMs) such as Contrastive Language–Image Pre-training (CLIP) have achieved remarkable success in aligning images and text, yet their explanations remain highly vulnerable to adversarial manipulation. Recent findings show that imperceptible perturbations can preserve model predictions while redirecting heatmaps toward irrelevant regions, undermining the faithfulness of the explanation. We introduce the X-Shift attack, a novel adversarial strategy that drives patch-level embeddings toward the target text embedding, thereby shifting explanation maps without altering output predictions. This reveals a previously unexplored vulnerability in VLM alignment. To counter this threat, we propose FaithShield Defense, a two-fold framework: (i) a dual-path redundant extension of CLIP that disentangles global and local token contributions, producing explanations more robust to perturbations; and (ii) a novel faithfulness-based detector that verifies explanation reliability via a masking test on top-$k$ salient regions. Explanations that fail this test are flagged as unfaithful. Extensive experiments show that X-Shift reliably compromises explanation faithfulness, while FaithShield restores robustness and enables principled detection of manipulations. Our work formalizes explanation-oriented adversarial attacks and offers a principled defense, enhancing trustworthy and verifiable explainability in VLMs.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 20728

Loading