Keywords: NLP explainability, concept-based explanations, causality
TL;DR: A framework that derives causal and concept-based explanations for complex NLP models
Abstract: The emergence of large-scale pretrained language models has posed unprecedented challenges in deriving explanations of why the model has made some predictions. Stemmed from the compositional nature of languages, spurious correlations have further undermined the trustworthiness of NLP systems. Thus, there exists an urgent demand for causal explanations to encourage fairness and transparency. To derive more causal, usable, and faithful explanations, we propose a complete framework for interpreting language models by deriving causal concepts. Specifically, we propose a post-hoc method that derives both high-level concepts and surface-level local explanations from hidden layer activations. To ensure causality, we optimize for a causal loss that maximizes the Average Treatment Effect (ATE), where we intervene on the concept-level as an innovative substitute to the traditional counterfactual interventions. Moreover, we devise several causality evaluation metrics for explanations that can be universally applied. Extensive experiments on real and synthetic tasks demonstrate that our method achieves superior results on causality, usability, and faithfulness compared to the baselines. Our codebase is available at \url{https://anonymous.4open.science/r/CausalConcept}.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Social Aspects of Machine Learning (eg, AI safety, fairness, privacy, interpretability, human-AI interaction, ethics)
10 Replies
Loading