Abstract: Language models achieved remarkable performance gains across multiple natural language processing and understanding tasks. They were shown to capture many high-level aspects of natural human language. However, the complexity of these models and their black-box nature make it difficult to understand their behavior based on fine-grained explanations. In this paper, we present high-level concept-based explanations for neural language models with a classification task setup using the quantitative testing with concept activation vectors TCAVQ. TCAVQ explains a neural model based on its activations in response to concepts present in the data. We propose a pipeline that automates the discovery of these concepts by clustering the model's activations. The pipeline was tested on one architecture (BERT) but can be applied to different neural architectures. We perform ablation and injection studies to evaluate the causality and importance of the explanations provided with regards to the model's predictions. The ablation studies show a 2% reduction in the model's sensitivity while injection shows up to a 13% reduction in specificity attributed to the top scoring concepts. This illustrates the potential of using concept-based explanations to verify model's alignment with human values and ethics by examining the concepts and how they contribute to the model's predictions.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability, Neural Language Models, Agglomerative Clustering
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4132
Loading