Keywords: mechanistic interpretability, neuron evaluation, vision models, CLIP, activation selectivity, interpretability metrics
TL;DR: We propose InterpScore, a multi-dimensional framework that moves beyond activation selectivity to systematically evaluate neuron interpretability in vision models.
Abstract: A central challenge in mechanistic interpretability is how to evaluate whether individual neurons genuinely capture meaningful features. Existing work relies heavily on *activation selectivity*, but this single metric quickly saturates and fails to distinguish among units, leaving many interpretability claims anecdotal. We propose **InterpScore**, a reproducible four-axis framework that integrates **Selectivity**, **Causal impact**, **Robustness**, and **Human consistency** into a compact composite measure. Applied to 10 high-selectivity neurons from CLIP RN50x4's penultimate layer (Radford et al., 2021), InterpScore reveals meaningful variation across neurons, about **14% coefficient of variation** where Selectivity alone shows none, demonstrating that multi-axis evaluation surfaces distinctions overlooked by single metrics. The framework is numerically stable across seeds and its axes capture complementary, independent aspects of neuron behavior. These results move neuron-level claims beyond anecdotes toward a more objective, systematic, and reproducible basis for assessing and comparing interpretability frameworks. Looking ahead, InterpScore offers a reproducible protocol for principled neuron evaluation across diverse vision architectures.
Primary Area: interpretability and explainable AI
Submission Number: 25464
Loading