Towards a theory of how the structure of language is acquired by deep neural networks

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hierarchical Models, Language Models, Learning Theory, Representation Learning, Self-Supervised Learning, Statistical Physics of Learning
TL;DR: Increasing the dataset size helps language models capture the latent hierarchical structure of data by leveraging token correlations.
Abstract: How much data is required to learn the structure of a language via next-token prediction? We study this question for synthetic datasets generated via a Probabilistic Context-Free Grammar (PCFG)---a hierarchical generative model that captures the tree-like structure of natural languages. We determine token-token correlations analytically in our model and show that they can be used to build a representation of the grammar's hidden variables, the longer the range the deeper the variable. In addition, a finite training set limits the resolution of correlations to an effective range, whose size grows with that of the training set. As a result, a Language Model trained with increasingly many examples can build a deeper representation of the grammar's structure, thus reaching good performance despite the high dimensionality of the problem. We conjecture that the relationship between training set size and effective range of correlations holds beyond our synthetic datasets, and we test it in a collection of lines from Shakespeare's plays. In particular, we show that reducing the input size leads to saturation of the test loss decay at a characteristic training set size that can be predicted in our framework.
Primary Area: Learning theory
Submission Number: 19625
Loading