Keywords: concepts, interpretability, concept bottleneck models, model editing
TL;DR: We present a method to turn any neural network into a concept bottleneck model without sacrificing model performance, retaining interpretability benefits along with easy model editing.
Abstract: Concept Bottleneck Models (CBMs) map the inputs onto a concept bottleneck and use the bottleneck to make a prediction. A concept bottleneck enhances interpretability since it can be investigated to understand what the model sees in an input, and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require concept labels during training to learn the bottleneck. Additionally, it is questionable if CBMs can match the accuracy of an unrestricted neural network trained on a given domain, potentially reducing the incentive to deploy them in practice. In this work, we address these two key limitations by introducing Post-hoc Concept Bottleneck models (P-CBMs). We show that we can turn any neural network into a P-CBM, without sacrificing model performance and retaining interpretability benefits. Finally, we show that P-CBMs can provide significant performance gains with model editing without any fine-tuning and needing data from the target domain.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2205.15480/code)