Keywords: memorization, scaling laws, large language models
TL;DR: We show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size.
Abstract: Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge.
In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. First, through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that:
(1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies;
(2) below a critical mixing ratio, the model memorizes almost
nothing even with extensive training, but beyond
this threshold, it rapidly memorizes more biographies.
We then adopt an information-theoretic perspective to understand and characterize the existence and value of the thresholds. Based on these insights, we identify two mitigation strategies that improve the efficiency of knowledge acquisition from knowledge-dense datasets, and validate their effectiveness on both synthetic and real-world Wikipedia datasets.
Submission Number: 82
Loading