Towards Faithful XAI Evaluation via Generalization-Limited Backdoor Watermark

Mengxi Ya; Yiming Li; Tao Dai; Bin Wang; Yong Jiang; Shu-Tao Xia

Towards Faithful XAI Evaluation via Generalization-Limited Backdoor Watermark

Mengxi Ya, Yiming Li, Tao Dai, Bin Wang, Yong Jiang, Shu-Tao Xia

Published: 16 Jan 2024, Last Modified: 05 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: XAI, XAI Evaluation, Backdoor Watermark, Backdoor Attack, Backdoor for Positive Purposes

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We reveal the unreliable nature and implementation limitations of existing backdoor-based SRV evaluation methods, based on which we propose a generalization-limited backdoor watermark and design a more faithful XAI evaluation.

Abstract: Saliency-based representation visualization (SRV) ($e.g.$, Grad-CAM) is one of the most classical and widely adopted explainable artificial intelligence (XAI) methods for its simplicity and efficiency. It can be used to interpret deep neural networks by locating saliency areas contributing the most to their predictions. However, it is difficult to automatically measure and evaluate the performance of SRV methods due to the lack of ground-truth salience areas of samples. In this paper, we revisit the backdoor-based SRV evaluation, which is currently the only feasible method to alleviate the previous problem. We first reveal its \emph{implementation limitations} and \emph{unreliable nature} due to the trigger generalization of existing backdoor watermarks. Given these findings, we propose a generalization-limited backdoor watermark (GLBW), based on which we design a more faithful XAI evaluation. Specifically, we formulate the training of watermarked DNNs as a min-max problem, where we find the `worst' potential trigger (with the highest attack effectiveness and differences from the ground-truth trigger) via inner maximization and minimize its effects and the loss over benign and poisoned samples via outer minimization in each iteration. In particular, we design an adaptive optimization method to find desired potential triggers in each inner maximization. Extensive experiments on benchmark datasets are conducted, verifying the effectiveness of our generalization-limited watermark. Our codes are available at \url{https://github.com/yamengxi/GLBW}.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: visualization or interpretation of learned representations

Submission Number: 3310

Loading