Keywords: LLMs, implicit meta learning, training dynamics
TL;DR: We study to what extent can Transformers can behave like sequence models of their training data, by anticipating their training stream in structured training setups
Abstract: Recent work showed that, in cyclic fine-tuning settings, transformers exhibit striking \emph{anticipatory recovery}: the loss on an upcoming document decreases \emph{before} that document is revisited.
We extend this phenomenon beyond individual-document cycling to bursty pre-training over mixtures of class-conditional distributions, where examples from the same class appear in predictable bursts -- so the model is trained on one class for several steps, then on the next one, and so on.
We show that Transformers anticipate upcoming classes despite having no explicit memory of training history, and that \emph{anticipation is linked to a lower training loss}.
In our experiments, models realize a substantial fraction of the anticipatory advantage achievable by sequence models that explicitly learn class order.
Additionally, we demonstrate anticipation in a setting where the next class can only be predicted by knowing the previous two classes, suggesting that Transformers can partially behave like sequence models over the training stream itself.
To study the optimization dynamics underlying anticipation, we analyze the implicit bias of bursty mini-batch training and identify an implicit alignment pressure between temporally adjacent class gradients.
Finally, we show that structured training order reshapes measured relationships between classes, suggesting that training order itself can act as a source of representational bias during optimization.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 124
Loading