Learning curves theory of hierarchically compositional data with power-law distributed features

Francesco Cagnetta; Hyunmo Kang; Matthieu Wyart

Learning curves theory of hierarchically compositional data with power-law distributed features

Francesco Cagnetta, Hyunmo Kang, Matthieu Wyart

Published: 09 Jun 2025, Last Modified: 11 Jul 2025HiLD at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scaling Laws, feature learning, hierarchical compositionality

Abstract: We study how two fundamental properties of natural data—hierarchical compositionality and Zipf-distributed features—affect the scaling of test performance with the number of training examples. Using synthetic datasets generated by probabilistic context-free grammars, we derive learning curves for classification and next-token prediction tasks in the data-limited regime. For classification, we show that introducing a Zipf distribution over production rules leads to a power-law learning curve with an exponent controlled by the Zipf distribution. By contrast, in next-token prediction, the exponent is determined by the hierarchical structure alone and is unaffected by Zipf statistics. These results are supported empirically by experiments with convolutional and transformer models, and highlight how different aspects of the data structure shape neural scaling laws.

Student Paper: No

Submission Number: 106

Loading