Learning curves theory for hierarchically compositional data with power-law distributed features

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: How do hierarchical structure and Zipf-distributed features affect learning curves? In classification, scaling depends on feature frequency; in next-token prediction, it’s governed solely by the data’s hierarchical structure.
Abstract: Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into units that are power-law distributed. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars—probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules’ distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the fine details of the learning curve, but not the exponent describing the large-scale behaviour.
Lay Summary: The performance of neural networks often improves predictably as they are trained on more data, typically following a power-law pattern. This phenomenon, known as neural scaling, has played a key role in recent advances in artificial intelligence. Yet, its underlying cause remains poorly understood. One line of thought attributes neural scaling to the uneven frequency of features in data. For example, in language, some words are far more common than others. Since frequent words appear more often during training, they are learned faster, leading to performance that scales with the distribution of feature frequencies. Another view points to the hierarchical nature of many real-world data sources, like the nested grammatical structure of sentences or the compositional layout of images. According to this view, neural networks improve as they progressively learn to reconstruct deeper levels of this hidden structure. In this paper, we bring these two ideas together using a simple, controlled model of data that mimics both the hierarchical organisation and the broad (Zipf-like) distribution of feature frequencies: Our model generates data through a hierarchy of probabilistic rules, some common, some rare. Our key finding is this: for language modelling tasks such as next-word prediction, it’s the hierarchical structure---not the frequency of individual elements---that governs how learning scales with data. This suggests that the remarkable scaling behaviour observed in large language models may originate not from surface-level statistics like word frequency, but from their ability to uncover and exploit the deep structure of language.
Link To Code: https://github.com/fracagnetta/random-hierarchy-model
Primary Area: Deep Learning->Theory
Keywords: Science of Deep Learning, Scaling laws, hierarchical compositionality, probabilistic graphical models
Submission Number: 10725
Loading