Post-hoc Self-explanation of CNNs

Published: 01 Mar 2026, Last Modified: 01 Mar 2026UCRL@ICLR2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Explaianble AI, mechanistic interpretability, clustering
TL;DR: A CNN is also an SEM that just needs better prototypes.
Abstract: Although standard Convolutional Neural Networks (CNNs) can be mathematically reinterpreted as Self-Explainable Models (SEMs), their built-in prototypes fail to accurately represent the data. This limitation is addressed by introducing a post-hoc framework that substitutes these internal weights with prototypes learned by $k$-means on feature activations. By utilizing shallower and thus less compressed feature activations, the proposed method provides detailed explanations of the predictions in the form of a segmentation map supported by a gradient-free attribution maps. The results demonstrate a trade-off of the method: employing deep feature activations (B4) enables the model to maintain its original accuracy, whereas incorporating earlier layers (B234) yields sharper and more interpretable explanation maps, at a slight cost to predictive performance.
Submission Number: 46
Loading