Keywords: mechanistic interpretability, neural representations, manifold hypothesis, group multiplication, divide and conquer, cayley graph
TL;DR: We show deep neural networks learn different implementations of one divide and conquer algorithm across random seeds.
Abstract: We find multilayer perceptrons and transformers both learn an instantiation of the same divide-and-conquer algorithm and solve dihedral multiplication with logarithmic feature efficiency. Applying principal component analyses to clusters of neurons where neurons have similar activation behavior reveals remarkably clear structure: \textit{the neural representations correspond to Cayley graphs.} It's noteworthy that these Cayley graphs are the low-dimensional manifolds networks are predicted to learn under the manifold hypothesis, and thus each neural representation learned by networks corresponds to a manifold. To our knowledge, our novel methodology allows this to be the first work that fully characterizes and describes the neural representations distributed over many neurons on a task.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 23640
Loading