Neuron-based explanations of neural networks sacrifice completeness and interpretability

TMLR Paper3015 Authors

17 Jul 2024 (modified: 21 Oct 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: High quality explanations of neural networks (NNs) should exhibit two key properties. Completeness ensures that they accurately reflect a network’s function and interpretability makes them understandable to humans. Many existing methods provide explanations of individual neurons within a network. In this work we provide evidence that for AlexNet pretrained on ImageNet, neuron-based explanation methods sacrifice both completeness and interpretability compared to activation principal components. Neurons are a poor basis for AlexNet embeddings because they don’t account for the distributed nature of these representations. By examining two quantitative measures of completeness and conducting a user study to measure interpretability, we show the most important principal components provide more complete and interpretable explanations than the most important neurons. Much of the activation variance may be explained by examining relatively few high-variance PCs, as opposed to studying every neuron. These principal components also strongly affect network function, and are significantly more interpretable than neurons. Our findings suggest that explanation methods for networks like AlexNet should avoid using neurons as a basis for embeddings and instead choose a basis, such as principal components, which accounts for the high dimensional and distributed nature of a network's internal representations.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: To address reviewer feedback and improve the submission's clarity and quality, we made the following changes: - Updated Section 3.3 text to better explain Karpathy's CNN-Codes method and how we extend it in this work - Updated Section 4.1.1 with formula for cumulative explained variance ratio and observation that conv4 and conv5 activations appear higher rank than other layers - Figure 4: Added 99% explained variance annotations and more specific y-axis labels - Added a limitations section before the conclusion - Added cumulative sum of explained variance ratio plots for ImageNet-pretrained ResNet-18 and ResNet-50 in Appendix F - Moved the neuron visualizations for conv1, conv2, and fc1 to the main body so it is easier to compare these to the PC visualizations within the main body. - Revised the neuron visualizations to be the top 5 highest variance neurons rather than the first 6 neurons by index. - Throughout the appendix we also modified the format to pair the PC and neuron visualizations for each layer. This facilitates easier comparisons between neurons and PCs. - Updated Section 4.1.2 with explanation of conv4 and conv5 results - Corrected equation 2 to reference eigenvalues rather than singular values - Updated abstract to limit claims to AlexNet - The heading for section 3.4 was updated to "Interpreting visualizations" rather than "Interpret visualization" for consistency with other headings - "Activation variance" is used repeatedly but not explained until page 8. Updated Section 3.2 with brief definition of activation variance to improve clarity: "PCA finds orthogonal basis vectors for $\mathbb{R}^d$ activation space, ordered by explained variance ($\Var(\mathbf{A}'_i)$ for the $i$th basis vector)" - Updated limitations paragraph on cases where neuron-based explanation might be useful with additional discussion: "In principle, a neuron basis could be more interpretable than a top PC basis. Such an example could be constructed with an auxiliary loss that explicitly aligns a neuron with a clear concept. However, standard network training does not particularly encourage this, and the probability of this happening on its own just seems to be low."
Assigned Action Editor: ~Quanshi_Zhang1
Submission Number: 3015
Loading