Keywords: Learning Dynamics, Mechanistic Interpretability, LLM Knowledge Acquisition, Continual Pre-Training
Abstract: Human beings primarily understand the world through concepts (e.g., $\textit{dog}$), abstract mental representations that structure perception, reasoning, and learning. However, how large language models (LLMs) acquire, retain, and forget such concepts during continual pretraining remains poorly understood. In this work, we study how individual concepts are acquired and forgotten, as well as how multiple concepts interact through interference and synergy. We further link these behavioral dynamics to LLMs’ internal $\textbf{Concept Circuits}$, computational subgraphs associated with specific concepts, and incorporate $\textbf{Graph Metrics}$ to characterize circuit structure. Our analysis reveals that: (1) LLMs concept circuits provides a non-trivial, statistically significant signal of concept learning and forgetting; (2) Concept circuits exhibit a stage-wise temporal pattern during continual pretraining, with an early increase followed by gradual decrease and stabilization; (3) concepts with larger learning gains tend to exhibit greater forgetting under subsequent training; (4) semantically similar concepts induce stronger interference than weakly related ones; and (5) pretraining on one knowledge type can facilitate learning of another, with highly directional and uneven benefits across ordered pairs. Together, our findings offer a circuit-level view of concept learning dynamics and inform the design of more interpretable and robust concept-aware training strategies for LLMs.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: pre-training,continual learning,knowledge tracing,probing
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 8516
Loading