Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Chun Hei Yip; Rajashree Agrawal; Lawrence Chan; Jason Gross

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross

20 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: mechanistic interpretability, proof, guarantees, interpretability, numerical integration

TL;DR: We provide mathematical and anecdotal evidence that an MLP layer in a neural network implements numerical integration.

Abstract: The goal of mechanistic interpretability is discovering a simple, low-rank algorithm implemented by models. While we can compress activations into features, compressing nonlinear feature-maps---like MLP layers---is an open problem. In this work, we present the first case study in rigorously compressing nonlinear feature-maps. We work in the classic setting of the modular addition models (Nanda et al., 2023), and target a non-vacuous bound on the behavior of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of the MLP layer in one-layer transformers implementing the “pizza” algorithm (Zhong et al., 2023): the MLP can be understood as evaluating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at [https://tinyurl.com/mod-add-integration](https://tinyurl.com/mod-add-integration).

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2241

Loading