TL;DR: A novel approach to construct hierarchical explanations for text classification by detecting feature interactions.
Abstract: The interpretability of neural networks has become crucial for their applications in real world with respect to the reliability and trustworthiness. Existing explanation generation methods usually provide important features by scoring their individual contributions to the model prediction and ignore the interactions between features, which eventually provide a bag-of-words representation as explanation. In natural language processing, this type of explanations is challenging for human user to understand the meaning of an explanation and draw the connection between explanation and model prediction, especially for long texts. In this work, we focus on detecting the interactions between features, and propose a novel approach to build a hierarchy of explanations based on feature interactions. The proposed method is evaluated with three neural classifiers, LSTM, CNN, and BERT, on two benchmark text classification datasets. The generated explanations are assessed by both automatic evaluation measurements and human evaluators. Experiments show the effectiveness of the proposed method in providing explanations that are both faithful to models, and understandable to humans.
Keywords: Hierarchical Interpretations, Natural Language Processing, Feature Interaction
Original Pdf: pdf
5 Replies
Loading