Stats or Facts: Decomposing Generalization in Language Models with Small-Scale Models

Tina Behnia; Puneesh Deora; Christos Thrampoulidis

Stats or Facts: Decomposing Generalization in Language Models with Small-Scale Models

Tina Behnia, Puneesh Deora, Christos Thrampoulidis

Published: 10 Jun 2025, Last Modified: 15 Jul 2025MOSS@ICML2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: factual recall, diversity, markov chains, language models, transformers, training dynamics

TL;DR: In a synthetic playground that disentangles statistical patterns from factual relations, we empirically study the training dynamics and show how context diversity shapes their interplay.

Abstract: Large language models learn both statistical patterns that make text fluent and factual associations between specific tokens that represent knowledge information. The complexity of natural language interweaving linguistic patterns and factual content challenges a systematic study of this capability. To address this, we introduce a Small-Scale Data Model (SSDM) designed to disentangle these components. The SSDM consists of a statistical stream of generic tokens, endowed with designated positional information, which composes with a separate factual stream of source-target token pairs representing knowledge. Partitioning the generating distribution of the statistical stream into sub-distributions, which we term templates, allows us to: (i) Independently vary the format of the templates (i.e., contextual structure) and the frequency with which facts appear within each template during training (i.e., contextual diversity); (ii) Measure both in-distribution and out-of-distribution generalization; and (iii) Distinguish between statistical, structural, and factual aspects of language model generalization. We demonstrate the flexibility of the SSDM by reporting example findings concerning: (a) the potentially catastrophic impact of low contextual diversity on either factual recall, statistical generalization, or both, contingent on the contextual structure; (b) observed stage-wise learning dynamics; and (c) hallucination.

Code: zip

Submission Number: 68

Loading