Keywords: interpretability, mechanistic interpretability, bilinear, feature extraction, weight-based, eigenvector, eigendecomposition, tensor network
TL;DR: The close-to-linear structure of bilinear MLPs enables weight-based analysis that reveals interpretable low rank structure across multiple modalities.
Abstract: A mechanistic understanding of how MLPs do computation in deep neural net-
works remains elusive. Current interpretability work can extract features from
hidden activations over an input dataset but generally cannot explain how MLP
weights construct features. One challenge is that element-wise nonlinearities
introduce higher-order interactions and make it difficult to trace computations
through the MLP layer. In this paper, we analyze bilinear MLPs, a type of
Gated Linear Unit (GLU) without any element-wise nonlinearity that neverthe-
less achieves competitive performance. Bilinear MLPs can be fully expressed in
terms of linear operations using a third-order tensor, allowing flexible analysis of
the weights. Analyzing the spectra of bilinear MLP weights using eigendecom-
position reveals interpretable low-rank structure across toy tasks, image classifi-
cation, and language modeling. We use this understanding to craft adversarial
examples, uncover overfitting, and identify small language model circuits directly
from the weights alone. Our results demonstrate that bilinear layers serve as an
interpretable drop-in replacement for current activation functions and that weight-
based interpretability is viable for understanding deep-learning models.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7965
Loading