Keywords: Transformer, Interpretability, XAI, Attention, Contrast-based
Abstract: Transformers have revolutionized AI research, particularly in natural language processing (NLP). However, understanding the decisions made by transformer-based models remains challenging, which impedes trust and safe deployment in real-world applications. While activation-based attribution methods have proven effective in explaining transformer-based text classification models, our findings suggest that they may suffer from class-irrelevant features within activations, potentially degrading the quality of their interpretations. To address this issue, we introduce Contrast-CAT, a novel activation contrast-based attribution method that improves token-level attribution by filtering out class-irrelevant features from activations. Contrast-CAT enhances interpretability by contrasting the activations of input sequences with reference activations, allowing for the generation of clearer and more faithful attribution maps. Our experiments demonstrate that Contrast-CAT consistently outperforms state-of-the-art methods across various datasets and models, achieving significant gains over the second-best methods with average improvements in AOPC and LOdds by $\times 1.30$ and $\times 2.25$, respectively, under the MoRF setting. Contrast-CAT provides a promising step forward in enhancing the interpretability and transparency of transformer-based models.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14172
Loading