Keywords: Scaling Laws, feature learning, hierarchical compositionality
Abstract: We study how two fundamental properties of natural data—hierarchical compositionality and Zipf-distributed features—affect the scaling of test performance with the number of training examples. Using synthetic datasets generated by probabilistic context-free grammars, we derive learning curves for classification and next-token prediction tasks in the data-limited regime. For classification, we show that introducing a Zipf distribution over production rules leads to a power-law learning curve with an exponent controlled by the Zipf distribution. By contrast, in next-token prediction, the exponent is determined by the hierarchical structure alone and is unaffected by Zipf statistics. These results are supported empirically by experiments with convolutional and transformer models, and highlight how different aspects of the data structure shape neural scaling laws.
Student Paper: No
Submission Number: 106
Loading