Neural Networks Learn Representation Theory: Reverse Engineering how Networks Perform Group Operations

Bilal Chughtai; Lawrence Chan; Neel Nanda

Neural Networks Learn Representation Theory: Reverse Engineering how Networks Perform Group Operations

Bilal Chughtai, Lawrence Chan, Neel Nanda

Published: 03 Mar 2023, Last Modified: 13 Mar 2023Physics4ML SpotlightReaders: Everyone

Keywords: Machine Learning, Mechanistic Interpretability, Interpretability, Representation Theory, Circuits, Universality, Convergent Learning, Group Theory

TL;DR: We reverse engineer networks trained to perform group composition, show they consistently learn an interpretable rep theory based algorithm, and use this as an algorithmic test bed to study the claim of 'universality' in mechanistic interpretability.

Abstract: We present a novel algorithm by which neural networks may implement composition for any finite group via mathematical representation theory, through learning several irreducible representations of the group and converting group composition to matrix multiplication. We show small networks consistently learn this algorithm when trained on composition of group elements by reverse engineering model logits and weights, and confirm our understanding using ablations. We use this as an algorithmic test bed for the hypothesis of universality in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks. By studying networks trained on various groups and architectures, we find mixed evidence for universality: using our algorithm, we can completely characterize the family of circuits and features that networks learn on this task, but for a given network the precise circuits learned -- as well as the order they develop -- are arbitrary.

Supplementary Material: zip

0 Replies

Loading