everyone
since 27 Oct 2023">EveryoneRevisionsBibTeX
Post-hoc interpretability of neural network models, including Large Language Models (LLMs), often aims for mechanistic interpretations — detailed, causal descriptions of model behavior. However, human interpreters may lack the capacity or willingness to formulate intricate mechanistic models, let alone evaluate them. This paper addresses this challenge by introducing a taxonomy which dissects the overarching goal of mechanistic interpretability into constituent claims, each requiring distinct evaluation methods. By doing so, we transform these evaluation criteria into actionable learning objectives, providing a data-driven pathway to interpretability. This framework enables a methodologically rigorous yet pragmatic approach to evaluating the strengths and limitations of various interpretability tools.