Keywords: language models, pretraining, synthetic procedurally-generated data, algorithmic reasoning
TL;DR: Pretraining transformers on algorithmically-generated structured data before standard training accelerates learning and improves performance across natural language, code, and informal mathematics.
Abstract: Pretraining on rich web-scale corpora is the de facto paradigm for building language models.
We study an alternative setting where the model is initially exposed to abstract structured data,
as a means to ease the subsequent acquisition of semantic knowledge,
much like mastering logic and mathematics for humans can support higher reasoning.
We specifically focus on *procedural data* generated
by formal languages and other simple algorithms.
**Method and findings.**
We first use small models to identify algorithmic skills that different forms of procedural data can improve, often significantly.
For example, on a diagnostic task for context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets).
Second, we study how these gains transfer from abstract to semantic domains in larger models.
We find that procedural pretraining significantly improves performance on natural language, code, and informal mathematics
(C4, CodeParrot, and DeepMind-Math datasets), using as little as 0.1% extra procedural data.
Notably, procedural pretraining also enables models to reach the same loss value with only 55, 67, 86% of the original data of these datasets.
Third, we explore the mechanisms behind these effects.
We find that procedural pretraining instils non-trivial structure in both attention and MLP layers, and that the former is particularly important for code datasets, the latter for language.
We also lay a path for combining the benefits of different forms of procedural data.
**Implications.**
Procedural pretraining is a remarkably simple means of improving performance and speeding up training for transformers.
It ultimately suggests the possibility of disentangling the acquisition of knowledge from reasoning in LLMs.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 12700
Loading