Generalization vs. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Published: 03 Jul 2024, Last Modified: 20 Jul 2024ICML 2024 FM-Wild Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language modeling, emergence, generalization, memorization, pretraining, corpus, NLP
TL;DR: A large-scale study on how models leverage their pretraining data to perform downstream tasks.
Abstract: Despite the proven utility of large language models (LLMs) in real-world applications, there remains a lack of understanding regarding how they leverage their large-scale pretraining text corpora to achieve such capabilities. In this work, we investigate the interplay between generalization and memorization in pre-trained LLMs at scale, through a comprehensive n-gram analysis of their training data. Our experiments focus on three general task types: translation, question-answering, and multiple-choice reasoning. With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant n-gram pair data becomes increasingly important, leading to improved task performance, decreased memorization, stronger generalization, and emergent abilities. Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data, and point the way to larger-scale analyses that could further improve our understanding of these models.
Submission Number: 91
Loading