Keywords: Circuit analysis, Understanding high-level properties of models, Applications of interpretability
Other Keywords: modular addition, geometry, topology, representation learning, manifold hypothesis, universality
TL;DR: We find that networks (MLPs, transformers) with learnable embeddings are approximating a torus-to-circle map, differing by how they factor it. We find clock and pizza are the same.
Abstract: Using tools from geometry and topology, we reveal that the circuits learned by
neural networks trained on modular addition are simply different implementations
of one global algorithmic strategy. We show that all architectures previously
studied on this problem learn topologically equivalent algorithms. Notably, this
finding concretely reveals that what appeared to be disparate circuits emerging
for modular addition in the literature are actually equivalent from a topological
lens. Furthermore, we introduce a new neural architecture that truly does learn a
topologically distinct algorithm. We then resolve this under the lens of geometry
however, and recover universality by showing that all networks studied learn modular
addition via approximating a torus-to-circle map. They differ in how they factor this map,
either via 2D toroidal intermediate representations, or via combinations of
certain projections of this 2D torus. Resultantly, we argue that our geometric and
topological perspective on neural circuits restores the universality hypothesis.
Submission Number: 279
Loading