Interpreting Pretrained Language Models via Concept Bottlenecks

Published: 15 Jun 2025, Last Modified: 07 Aug 2025AIA 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Concept Bottleneck Models, Language Models, Explanations
Abstract: Pretrained language models (PLMs) achieve state-of-the-art results but often function as ``black boxes'', hindering interpretability and responsible deployment. While methods like attention analysis exist, they often lack clarity and intuitiveness. We propose interpreting PLMs through high-level, human-understandable concepts using Concept Bottleneck Models (CBMs). This extended abstract introduces $C^{3}M$ (\underline{C}hatGPT-guided \underline{C}oncept augmentation with \underline{C}oncept-level \underline{M}ixup), a novel framework for training Concept-Bottleneck-Enabled PLMs (CBE-PLMs). $C^{3}M$ leverages Large Language Models (LLMs) like ChatGPT to augment concept sets and generate noisy concept labels, combined with a concept-level MixUp mechanism to enhance robustness and effectively learn from both human-annotated and machine-generated concepts. Empirical results show our approach provides intuitive explanations, aids model diagnosis via test-time intervention, and improves the interpretability-utility trade-off, even with limited or noisy concept annotations. Code and data are released at \url{https://github.com/Zhen-Tan-dmml/CBM\_NLP.git}
Paper Type: Previously Published Paper
Venue For Previously Published Paper: IJCAI Sister track
Submission Number: 7
Loading