Mechanistic Interpretability Needs Philosophy

Iwan Williams; Ninell Oldenburg; Ruchira Dhar; Joshua Hatherley; Constanza Fierro; Nina Rajcic; Sandrine R. Schiller; Filippos Stamatiou; Anders Søgaard

Mechanistic Interpretability Needs Philosophy

Iwan Williams, Ninell Oldenburg, Ruchira Dhar, Joshua Hatherley, Constanza Fierro, Nina Rajcic, Sandrine R. Schiller, Filippos Stamatiou, Anders Søgaard

13 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025 Position Paper TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mechanistic interpretability, philosophy, explanation, features, deception

TL;DR: We show how tools from philosophy can help make progress on three broad research questions in mechanistic interpretability.

Abstract: Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems. Taking three open problems from the MI literature as examples, this position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.

Submission Number: 275

Loading