Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets

Wei Liu; Zhongyu Niu; Lang Gao; Zhiying Deng; Jun Wang; Haozhao Wang; Ruixuan Li

Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets

Wei Liu, Zhongyu Niu, Lang Gao, Zhiying Deng, Jun Wang, Haozhao Wang, Ruixuan Li

Published: 01 May 2025, Last Modified: 16 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This study investigates the self-rationalization framework constructed with a cooperative game, where a generator initially extracts the most informative segment from raw input, and a subsequent predictor utilizes the selected subset for its input. The generator and predictor are trained collaboratively to maximize prediction accuracy. In this paper, we first uncover a potential caveat: such a cooperative game could unintentionally introduce a sampling bias during rationale extraction. Specifically, the generator might inadvertently create an incorrect correlation between the selected rationale candidate and the label, even when they are semantically unrelated in the original dataset. Subsequently, we elucidate the origins of this bias using both detailed theoretical analysis and empirical evidence. Our findings suggest a direction for inspecting these correlations through attacks, based on which we further introduce an instruction to prevent the predictor from learning the correlations. Through experiments on six text classification datasets and two graph classification datasets using three network architectures (GRUs, BERT, and GCN), we show that our method significantly outperforms recent rationalization methods.

Lay Summary: (2) In the field of interpretability, it is common to identify rationales by maximizing prediction accuracy, which is prone to being affected by spurious correlations. (2)Existing research on spurious correlations has primarily focused on those arising from the original data generation process. However, we find that certain explanation algorithms themselves can introduce additional spurious correlations, even when the original dataset is clean. (3) We introduce an attack-based framework to audit and mitigate these spurious correlations introduced by the algorithms.

Link To Code: https://github.com/jugechengzi/Rationalization-A2I

Primary Area: Social Aspects->Accountability, Transparency, and Interpretability

Keywords: Unsupervised rationale extraction, selective rationalization, self-explaining, interpretability

Submission Number: 14655

Loading