Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Lucius Bushnaq; Jake Mendel; Stefan Heimersheim; Dan Braun; Nicholas Goldowsky-Dill; Kaarel Hänni; Cindy Wu; Marius Hobbhahn

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn

Published: 24 Jun 2024, Last Modified: 31 Jul 2024ICML 2024 MI Workshop SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mechanistic Interpretability, Singular Learning Theory, Loss Landscapes

TL;DR: We identify ways that network parameters can be degenerate and introduce the Interaction Basis, a representation of the network that is invariant to degeneracies from linear dependence of activations or Jacobians.

Abstract: Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular Learning Theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable. We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies from linear dependence of activations or Jacobians.

Submission Number: 79

Loading