Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers

Sungmin Han; Jeonghyun Lee; Sangkyun Lee

Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers

Sungmin Han, Jeonghyun Lee, Sangkyun Lee

Published: 07 May 2025, Last Modified: 28 Jul 2025UAI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformer, Interpretability, XAI, Attention, Contrast-based

Abstract: Transformers have profoundly influenced AI research, but explaining their decisions remains challenging -- even for relatively simpler tasks such as classification -- which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of $\times 1.30$ in AOPC and $\times 2.25$ in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.

Latex Source Code: zip

Readers: auai.org/UAI/2025/Conference, auai.org/UAI/2025/Conference/Area_Chairs, auai.org/UAI/2025/Conference/Reviewers, auai.org/UAI/2025/Conference/Submission790/Authors, auai.org/UAI/2025/Conference/Submission790/Reproducibility_Reviewers

Submission Number: 790

Loading