TL;DR: The notion of mechanistic interpretation can be ad-hoc. We axiomatize this notion and apply these axioms to validate mechanistic interpretations extracted for two case studies.
Abstract: Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a *mechanistic interpretation* itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.
Lay Summary: Machine learning systems based on large and opaque neural networks are increasingly becoming both more capable and more heavily used in important real-world applications. However, we still don't understand how these models make their decisions.
Mechanistic interpretations are a way to explain a model's decisions with simple, easily understood, programs which describe the meaning behind a model's internal computations. However, validating that a claimed interpretation actually describes the behavior of a real model is challenging.
We present an approach for validating mechanistic interpretations which translates between the values that the original model and the interpretation operate on, to ensure that the models represent the same concepts, and to ensure that pieces of the model can be exchanged with their interpretations without affecting the model’s behavior.
We apply our approach to two case studies, an original mechanistic interpretation of a model trained to perform a simple logic task, and a well-known analysis of a model trained on modular addition.
Link To Code: https://github.com/nilspalumbo/axiomatic-validation
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: mechanistic interpretability, axiomatic approach
Submission Number: 7836
Loading