Algorithmic Phase Transitions in Large Language Models: A Mechanistic Case Study of Arithmetic

NeurIPS 2024 Workshop ATTRIB Submission19 Authors

Published: 30 Oct 2024, Last Modified: 14 Jan 2025ATTRIB 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, language models, logical reasoning
Abstract: The zero-shot capabilities of large language models make them powerful tools for solving a range of tasks without explicit training. However, it remains unclear how these models achieve such performance, or why they can zero-shot some tasks but not others. In this paper, we shed some light on this phenomenon by investigating algorithmic stability in language models: how perturbations in task specifications may change the problem-solving strategy employed by the model. While certain tasks may benefit from algorithmic instability (for example, sorting or searching under different assumptions), we focus on a task where algorithmic stability is needed: two-operand arithmetic. Surprisingly, even on this straightforward task, we find that `Gemma-2-2b` employings substantially different computational models on closely related subtasks. Our findings suggest that algorithmic instability may be a contributing factor to language models' poor zero-shot performance across many logical reasoning tasks, as they struggle to abstract different problem-solving strategies and smoothly transition between them.
Submission Number: 19
Loading