Keywords: mechanistic interpretability, proof, guarantees, interpretability, numerical integration
TL;DR: We provide mathematical and anecdotal evidence that an MLP layer in a neural network implements numerical integration.
Abstract: Extending the analysis from Nanda et al. (2023) and Zhong et al. (2023), we offer an end-to-end interpretation of the 1 layer MLP-only modular addition transformer model with symmetric embeds.
We present a clear and mathematically rigorous description of the computation at each layer, in preparation for the proofs-based verification approach as set out in concurrent work under review.
In doing so, we present a new interpretation of MLP layers: that they implement quadrature schemes to carry out numerical integration, providing anecdotal and mathematical evidence in support.
This overturns the existing idea that neurons in neural networks are merely on-off switches that test for the presence of ``features'' -- instead multiple neurons can be combined in non-trivial ways to produce continuous quantities.
Submission Number: 26
Loading