What a Small Autoregressive Transformer Briefly Learns and Then Forgets: Transient Structural Capabilities and Probe-Specific Head Repurposing
Keywords: Transient Emergence, Training Dynamics, Autoregressive Transformers, Structural Probes, Activation Patching, Mechanistic Interpretability, Memorization vs. Generalization
TL;DR: We show that an 11.5M autoregressive transformer briefly learns structural probes during pretraining, then forgets them, while lexical probes remain stable. Patching and ablations localize the collapse to repurposed early attention heads.
Abstract: We use a tiny autoregressive generative model (11.5M parameters, TinyStories-like corpus) as a screening testbed for memorization-versus-generalization dynamics during pretraining, and ask which behaviors the model truly internalizes versus which it only briefly approximates. Most lexical or associative probes meet our emergence-time reproducibility criterion across five seeds, behaving as if the model has stably internalized the relevant token associations. Three of eight structural probes follow a different pattern. Each rises to a peak between training steps 180 and 2,130 and then collapses in every seed: end_of_sentence from 74\% to 24\%, modal_continuation from 87\% to 66\%, adjective_order from 89\% to 64\%. Same-trajectory peak/final checkpoints from two further mechanistic seeds let us localize the end_of_sentence collapse: residual-stream activation patching identifies blocks 0-2 ($+38.3$ pp at $z = 2.85$ for seed 17, $+23.3$ pp at $z = 1.83$ for seed 99), and per-head ablation picks out one critical head per seed (B0H0 / B0H1; the head identity does not replicate). A final-checkpoint probe-specificity control matches the DLA-predicted mechanism in all three tested head-seed cases: reversal heads show opposite-sign effects across probes, while a weakening head shows no beneficial ablation signal. Both signatures argue against a general head-degradation account. We present this as a hypothesis-generating mechanistic result at one small scale, not a scaling claim: the picture is a memorization-vs-generalization gap that opens during pretraining, which we localize but do not fully explain.
Submission Number: 227
Loading