Beyond Anecdotal Evidence: A Systematic Framework for Evaluating Neuron Interpretability

Beyond Anecdotal Evidence: A Systematic Framework for Evaluating Neuron Interpretability

ICLR 2026 Conference Submission25464 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mechanistic interpretability, neuron evaluation, vision models, CLIP, activation selectivity, interpretability metrics

TL;DR: We propose InterpScore, a multi-dimensional framework that moves beyond activation selectivity to systematically evaluate neuron interpretability in vision models.

Abstract: A central challenge in mechanistic interpretability is how to evaluate whether individual neurons genuinely capture meaningful features. Existing work relies heavily on *activation selectivity*, but this single metric quickly saturates and fails to distinguish among units, leaving many interpretability claims anecdotal. We propose **InterpScore**, a reproducible four-axis framework that integrates **Selectivity**, **Causal impact**, **Robustness**, and **Human consistency** into a compact composite measure. Applied to 10 high-selectivity neurons from CLIP RN50x4's penultimate layer (Radford et al., 2021), InterpScore reveals meaningful variation across neurons, about **14% coefficient of variation** where Selectivity alone shows none, demonstrating that multi-axis evaluation surfaces distinctions overlooked by single metrics. The framework is numerically stable across seeds and its axes capture complementary, independent aspects of neuron behavior. These results move neuron-level claims beyond anecdotes toward a more objective, systematic, and reproducible basis for assessing and comparing interpretability frameworks. Looking ahead, InterpScore offers a reproducible protocol for principled neuron evaluation across diverse vision architectures.

Primary Area: interpretability and explainable AI

Submission Number: 25464

Loading