Keywords: language models, memorization, generalization
TL;DR: We investigate the interplay between learning by rote (memorization) and learning with understanding in large language models using formal grammars.
Abstract: Understanding whether and to what extent token sequences generated by large language models (LLMs) are the result of regurgitating memorized training data or are based on meaningful learning of the training data's syntax and semantics has many important implications.
In order to cleanly measure and disentangle token recollection by rote (memorization) from generation with understanding, we create an experimental framework that is based on training LLMs over *sequences generated using formal grammars*. Our framework allows us to better understand the interplay between the two types of learning, namely, *by rote* vs. *with understanding*. Using our framework we make several striking observations that hold consistently across different open-source model families (Pythia, Llama, and Mistral): (a) we find that the learning types are at odds with each other during training, i.e., rote learning harms understanding and by developing understanding, models forget previously memorized sequences, (b) we find that *entropy of the training datasets* impacts the ease of learning, with lower entropy datasets being easier to learn with understanding and higher entropy datasets being easier to learn by rote, (c) we highlight the difficulty of determining the type of learning involved in a model based solely on recollecting a training data sequence. Our surprising results have significant downstream implications in the study and usage of LLMs.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11023
Loading