Control and Predictivity in Neural Interpretability

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Causal interventions, Circuit analysis, Foundational work
Other Keywords: representational divergence
TL;DR: We empirically and theoretically explore the divergence between native and causally intervened representations, and we discuss the implications.
Abstract: For the goals of mechanistic interpretability, correlational methods are typically easy to scale and use, and can provide strong predictivity of Neural Network (NN) representations. However, they can lack causal fidelity which can limit their relevance to NN computation and behavior. Alternatively, causal approaches can offer strong behavioral control via targeted interventions, making them superior for understanding computational cause and effect. However, what if causal methods use out-of-distribution representations to produce their effects? Does this raise concerns about the faithfulness of the claims that can be made about the NN's native computations? In this work, we explore this possibility of this representational divergence. We ask to what degree do causally intervened representations diverge from the native distribution, and in what situations is this divergence acceptable? Using Distributed Alignment Search (DAS) as a case study, we first demonstrate the existence of causally intervened representational divergence in interventions that provide strong behavioral control, and we show that stronger behavioral control can correlate with more divergent intervened representations. We then provide a theoretical discussion showing sufficient ways for this divergence to occur in both innocuous and potentially pernicious ways. We then provide a theoretical demonstration that causal interventions typically assume principles of additivity, calling into question the use of nonlinear methods for causal manipulations. Lastly, for cases in which representational divergence is undesirable, we demonstrate how to incorporate a counterfactual latent loss to constrain intervened representations to remain closer to the native distribution. Together, we use our results to suggest that although causal methods are superior for most interpretability goals, a complete account of NN representations balances computational control with neural predictivity, with the optimal weighting depending on the goals of the research.
Submission Number: 230
Loading