Concept-Based Explanations for Neural Language Models

Concept-Based Explanations for Neural Language Models

ACL ARR 2025 May Submission4132 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Language models achieved remarkable performance gains across multiple natural language processing and understanding tasks. They were shown to capture many high-level aspects of natural human language. However, the complexity of these models and their black-box nature make it difficult to understand their behavior based on fine-grained explanations. In this paper, we present high-level concept-based explanations for neural language models with a classification task setup using the quantitative testing with concept activation vectors TCAVQ. TCAVQ explains a neural model based on its activations in response to concepts present in the data. We propose a pipeline that automates the discovery of these concepts by clustering the model's activations. The pipeline was tested on one architecture (BERT) but can be applied to different neural architectures. We perform ablation and injection studies to evaluate the causality and importance of the explanations provided with regards to the model's predictions. The ablation studies show a 2% reduction in the model's sensitivity while injection shows up to a 13% reduction in specificity attributed to the top scoring concepts. This illustrates the potential of using concept-based explanations to verify model's alignment with human values and ethics by examining the concepts and how they contribute to the model's predictions.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability, Neural Language Models, Agglomerative Clustering

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4132

Loading