Understanding Knowledge Acquisition and Release in Language Models via Circuits
Keywords: circuits, grokking, forgetting
TL;DR: We present evidence that grokking and forgetting are related through the stability of a model's circuits
Abstract: General agents must acquire new capabilities while preserving existing ones. Two phenomena make this balance difficult: grokking, where memorization abruptly ends during training; and forgetting, where previously learned skills rapidly degrade under sequential learning. Although both are typically studied in isolation, we argue that they admit a unified mechanistic explanation. For a fixed task, we hypothesize that grokking and forgetting occur precisely when the stability of a model's circuits increases and decreases across subtasks, respectively. Through a case study of `Llama-3.2-1B` across tasks such as factual retrieval, logical and commonsense reasoning, as well as bias evaluation, we find evidence supporting this hypothesis. To our knowledge, this is the first architecture- and task-agnostic measure for grokking and forgetting. Our results suggest that by leveraging mechanistic insights, generalization phase transitions can be measured directly on the training set.
Submission Number: 186
Loading