Mechanistic?

Naomi Saphra; Sarah Wiegreffe

Mechanistic?

Naomi Saphra, Sarah Wiegreffe

Published: 21 Sept 2024, Last Modified: 20 Oct 2024BlackboxNLP 2024EveryoneRevisionsBibTeXCC BY 4.0

Track: Full paper

Keywords: mechanistic interpretability; position; survey

Abstract: The rise of the term "mechanistic interpretability" has accompanied increasing interest in understanding neural models---particularly language models. However, this jargon has also led to a fair amount of confusion. So, what does it mean to be mechanistic? We describe four uses of the term in interpretability research. The most narrow technical definition requires a claim of causality, while a broader technical definition allows for any exploration of a model's internals. However, the term also has a narrow cultural definition describing a cultural movement. To understand this semantic drift, we present a history of the NLP interpretability community and the formation of the separate, parallel mechanistic interpretability community. Finally, we discuss the broad cultural definition---encompassing the entire field of interpretability---and why the traditional NLP interpretability community has come to embrace it. We argue that the polysemy of "mechanistic" is the product of a critical divide within the interpretability community.

Copyright PDF: pdf

Submission Number: 83

Loading