Attacking for Inspection and Instruction: Debiasing Self-explaining Text Classification

Wei Liu; Jun Wang; Haozhao Wang; Ruixuan Li; Zhiying Deng; YuanKai Zhang

Attacking for Inspection and Instruction: Debiasing Self-explaining Text Classification

Wei Liu, Jun Wang, Haozhao Wang, Ruixuan Li, Zhiying Deng, YuanKai Zhang

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: Sampling bias, interpretability, self-explaining

Abstract: eXplainable Artificial Intelligence (XAI) techniques are indispensable for increasing the transparency of deep learning models. Such transparency facilitates a deeper human comprehension of the model's fairness, security, robustness, among other attributes, leading to heightened trust in the model's decisions. An important line of research in the field of NLP involves self-explanation using a cooperative game, where a generator selects a semantically consistent subset of the input as the explanation, and a subsequent predictor makes predictions based on the selected subset. In this paper, we first uncover a potential caveat: such a cooperative game could unintentionally introduce a sampling bias between the explanation and the target prediction label. Specifically, the generator might inadvertently create an incorrect correlation between the selected explanation and the label, even when they are semantically unrelated in the original dataset. Subsequently, we elucidate the origins of this bias using both theoretical analysis and empirical evidence. Our findings suggest a direction for mitigating this bias, and we introduce an adversarial game as a practical solution. Experiments on two widely used real-world benchmarks show the effectiveness of the proposed method.

Primary Area: visualization or interpretation of learned representations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6689

Loading