Attacking for Inspection and Instruction: Attack Techniques Can Aid In Interpretability

Wei Liu; Zhongyu Niu; Lang Gao; Zhiying Deng; Jun Wang; Haozhao Wang; Zhigang Zeng; Ruixuan Li

Attacking for Inspection and Instruction: Attack Techniques Can Aid In Interpretability

Wei Liu, Zhongyu Niu, Lang Gao, Zhiying Deng, Jun Wang, Haozhao Wang, Zhigang Zeng, Ruixuan Li

26 Sept 2024 (modified: 07 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability, natural language processing, feature selection

Abstract: This study investigates a self-explantory natural language processing framework constructed with a cooperative game, where a generator first extracts the most informative segment from raw input, and a subsequent predictor utilizes the selected subset for its input. The generator and predictor are trained collaboratively to maximize prediction accuracy. In this paper, we first uncover a potential caveat: such a cooperative game could unintentionally introduce a sampling bias between the explanation and the target prediction label. Specifically, the generator might inadvertently create an incorrect correlation between the selected explanation and the label, even when they are semantically unrelated in the original dataset. Subsequently, we elucidate the origins of this bias using both detailed theoretical analysis and empirical evidence. Our findings suggest a direction for inspecting these correlations through attacks, based on which we further introduce an instruction to prevent the predictor from learning the correlations. Through experiments on six text classification datasets and one graph classification dataset using three network architectures (GRUs, BERT, and GCN), we show that our attack-inspired method outperforms recent competitive methods. We also compare our method against a representative LLM (llama-3.1-8b-instruct), and demonstrate that our approach achieves comparable results, sometimes even surpassing it.

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6022

Loading