Emergent Symbol-like Number Variables in Artificial Neural Networks

TMLR Paper4243 Authors

19 Feb 2025 (modified: 14 Apr 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: What types of numeric representations emerge in neural systems and how can we best understand them? In this work, we interpret Neural Network (NN) solutions to sequence based counting tasks through a variety of lenses. We seek to understand how well we can understand NNs through the lens of interpretable Symbolic Algorithms (SAs), where SAs are defined by precise, abstract, mutable variables to perform computations. We use GRUs, LSTMs, and Transformers trained using Next Token Prediction (NTP) on numeric tasks in which the solutions to the tasks depend on latent information in the task structure. We show through multiple causal and theoretical methods that we can interpret NN’s raw activity through the lens of simplified SAs when we frame the neural activity in terms of interpretable subspaces rather than individual neurons. Depending on the analysis, however, these interpretations can be graded, existing on a continuum, highlighting the philosophical question of what it means to "interpret" neural activity, motivating us to introduce Alignment Functions to add flexibility to the existing Distributed Alignment Search (DAS) method. Through our specific analyses we show the importance of causal interventions for NN interpretability; we show that recurrent models develop graded, symbol- like number variables within their neural activity; we introduce a generalization of DAS to frame NN activity in terms of linear functions of interpretable variables; and we show that Transformers must use anti-Markovian solutions—solutions that avoid using cumulative, Markovian hidden states—in the absence of sufficient attention layers. We use our results to encourage interpreting NNs through the lens of SAs using a variety of theoretic and causal analyses.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. We have extended the DAS framework by introducing Alignment Functions in Methods section 3.4 which are a relaxation of the orthogonal constraint typically used in the DAS rotation matrix. 2. We have rewritten our formulation of DAS also in section 3.4 for better cohesion with the introduction of Alignment functions. 3. We have largely reorganized the paper into sections for RNN analyses and sections for Transformer analyses in order to provide greater clarity on our goals and the purpose behind our analyses 4. We have introduced a proof of why transformers will always use anti-Markovian solutions in the absence of sufficient layers in results section 4.2.1 5. We have brought a theoretical treatment of a simplified transformer counting setting out of the supplement into the main text in Results Section 4.2.2 6. We have redone the majority of our analyses with larger models and more satisfying intervention data. We have added more detailed descriptions and examples of the intervention data in Supplemental section A.3.3-A.3.6. More specifically, whereas before we arbitrarily limited the number of steps in the demonstration phase to 3 following an intervention, we now allow any number of steps up to a count of 20 after interventions occurring in the demo phase. We have also removed intervention samples corresponding to poorly defined situations in the demo and resp count cases.
Assigned Action Editor: ~Erin_Grant1
Submission Number: 4243
Loading