Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We unify a diverse set of neuron explanation evaluations under one mathematical framework, and find many existing methods are significantly flawed and do not pass simple sanity checks.
Abstract: Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare and contrast existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical concepts on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.
Lay Summary: Current neural networks can achieve impressive results, but we don't really understand what happens inside them. Understanding the function of individual units in a neural network can help bring light to this and bring us towards safer and more trustworthy models. Individual neurons are often explained by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare and contrast existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical concepts on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our results we propose new guidelines on how we should evaluate neuron explanations.
Link To Code: https://github.com/Trustworthy-ML-Lab/Neuron_Eval
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Interpretability, Mechanistic Interpretability, Interpretability Evaluation
Submission Number: 4065
Loading