Pretraining with Masked Backstories in a Toy World

Sultan Daniels; Dylan Davis; Gireeja Ranade; Anant Sahai

Pretraining with Masked Backstories in a Toy World

Sultan Daniels, Dylan Davis, Gireeja Ranade, Anant Sahai

Published: 02 Mar 2026, Last Modified: 03 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: content-enhanced learning, toy problems, modulating emergence

TL;DR: We study an algorithmic toy problem, where small scale models (millions of parameters) exhibit both IWL and ICL and we investigate a technique for modulating the development of a capability during training.

Abstract: Context-enhanced learning (CEL) involves augmenting context in large language models (LLMs) with special masked context to accelerate learning. It has previously only been explored using LLMs with billions of parameters during finetuning because CEL requires in-context-learning (ICL) abilities to work. Here, we leverage a toy world (symbolically labeled randomly interleaved vector time-series from linear deterministic dynamical systems) that admits LLM-style ``next token'' pretraining and has been shown to exhibit multiple emergences of different ICL/recall abilities in tiny transformer models with mere millions of parameters. In this toy world, we can see a late transition from ICL to in-weights learning that also corresponds to a degradation of ICL performance on time-series from systems never seen during training. We enhance pretraining with additional masked context that allows the model to make near perfect predictions on the original training examples. Masking this additional context disincentivizes the model from memorizing it, and the capability of perfect prediction on the training example disincentivizes the model from memorizing the remaining portion of the training example. Not only does this enhancement suppress in-weights learning of the specific training systems, but we also show that it improves the quality of ICL in the model, including the seemingly unrelated task of associative recall. Even more surprisingly, another experiment shows that despite such a model during training only seeing losses (and hence gradients) for tokens that are perfectly predictable, it generalizes well at test time when predicting tokens that are not perfectly predictable, nearly matching the performance of the optimal solution for those cases.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 108

Loading