Keywords: Foundational work
Other Keywords: Neuron identification
TL;DR: We provide a theoretical framework for neuron identification with generalization bounds for faithfulness and bootstrap ensembles for stability, enabling principled and trustworthy neuron explanations.
Abstract: Neuron identification is a popular tool in mechanistic interpretability, aiming to uncover the human-interpretable concepts
represented by individual neurons in deep networks. While algorithms such as Network Dissection and CLIP-Dissect achieve great empirical success,
a rigorous theoretical foundation remains lacking. In this work, we formalize neuron identification as the *reverse process of learning*, which allows us to import tools from statistical learning theory. From this perspective, we present the fist theoretical analysis of two foundamental challenges: (1) **Faithfulness:** whether the identified concept truly represents the neuron and (2)**Stability:** whether the results are consistent across probing datasets.
We derive generalization bounds for widely used similarity metrics (e.g. accuracy, AUROC, IoU) to guarantee faithfulness, and propose a bootstrap ensemble procedure that quantifies stability and provides probabilistic guarantees via prediction sets. Experiments on both synthetic and real data validate our theoretical results and demonstrate the practicality of our method, providing a step toward trustworthy neuron identification.
Submission Number: 38
Loading