Keywords: language models, linear probes, training-order memorization, training-order recency, activation analysis, sequential fine-tuning
TL;DR: Language models appear to memorize when they learned information, with probes achieving ~90% accuracy at distinguishing whether entities appeared early vs. late in training.
Abstract: Language models' activations appear to linearly encode the recency of training data exposure. Our setup involves sequentially fine-tuning Llama-3.2-1B on two disjoint but otherwise similar datasets about named entities, followed by training linear probes on the activations of this fine-tuned model. We find that probes can accurately ($\sim$90) distinguish "early" vs. "late" entities, generalizing to entities unseen during the probes' own training. Furthermore, the model can be fine-tuned to explicitly report an unseen entity's training stage ($\sim$80 accuracy). Similar experiments involving sequential finetuning on six disjoint datasets confirm a linear direction tracking the order of learning. Notably, this temporal signal does not seem clearly attributable to simple differences in activation magnitudes or output logit statistics. Our results reveal a fundamental mechanism enabling models to differentiate information by its acquisition time, and carry significant implications for how they might form beliefs, manage conflicting data, and respond to knowledge modifications.
Submission Number: 45
Loading