Understanding Knowledge Acquisition and Release in Language Models via Circuits

Kiran Raja; Arav Maheria; Andrew Bae; Alan Sun

Understanding Knowledge Acquisition and Release in Language Models via Circuits

Kiran Raja, Arav Maheria, Andrew Bae, Alan Sun

Published: 02 Mar 2026, Last Modified: 10 Apr 2026LLA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: circuits, grokking, forgetting

TL;DR: We present evidence that grokking and forgetting are related through the stability of a model's circuits

Abstract: General agents must acquire new capabilities while preserving existing ones. Two phenomena make this balance difficult: grokking, where memorization abruptly ends during training; and forgetting, where previously learned skills rapidly degrade under sequential learning. Although both are typically studied in isolation, we argue that they admit a unified mechanistic explanation. For a fixed task, we hypothesize that grokking and forgetting occur precisely when the stability of a model's circuits increases and decreases across subtasks, respectively. Through a case study of `Llama-3.2-1B` across tasks such as factual retrieval, logical and commonsense reasoning, as well as bias evaluation, we find evidence supporting this hypothesis. To our knowledge, this is the first architecture- and task-agnostic measure for grokking and forgetting. Our results suggest that by leveraging mechanistic insights, generalization phase transitions can be measured directly on the training set.

Submission Number: 186

Loading