What's your Use Case? A Taxonomy of Causal Evaluations of Post-hoc Interpretability

Published: 27 Oct 2023, Last Modified: 05 Dec 2023CRL@NeurIPS 2023 PosterEveryoneRevisionsBibTeX
Keywords: post-hoc interpretability, faithfulness, evals, causality
TL;DR: This paper introduces a structured taxonomy based on the causal hierarchy to rigorously evaluate and guide post-hoc interpretability of LLMs.
Abstract: Post-hoc interpretability of neural network models, including Large Language Models (LLMs), often aims for mechanistic interpretations — detailed, causal descriptions of model behavior. However, human interpreters may lack the capacity or willingness to formulate intricate mechanistic models, let alone evaluate them. This paper addresses this challenge by introducing a taxonomy which dissects the overarching goal of mechanistic interpretability into constituent claims, each requiring distinct evaluation methods. By doing so, we transform these evaluation criteria into actionable learning objectives, providing a data-driven pathway to interpretability. This framework enables a methodologically rigorous yet pragmatic approach to evaluating the strengths and limitations of various interpretability tools.
Submission Number: 34