Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models
TL;DR: This paper identifies a data-complexity-driven scaling law that predicts the minimal model size needed for implicit reasoning during pretraining.
Abstract: Reasoning is a core capability of language models (LMs), yet it remains unclear how much model capacity is necessary to support reasoning during pretraining. In this work, we study the minimal parameter budget required for implicit reasoning, defined as the ability to infer new facts from learned knowledge without explicit chain-of-thought supervision. To isolate this phenomenon, we pretrain LMs from scratch in a controlled synthetic environment that mimics the structure and distribution of real-world knowledge graphs, and evaluate their ability to complete missing edges via multi-hop inference.
From both a theoretical and an empirical perspective, we identify a scaling law linking this optimal parameter budget to a graph search entropy measure.
Across a wide range of model sizes, training steps, and graph complexities, we show that an optimally sized language model can reliably reason over approximately 0.008 bits of information per parameter at most.
Our results characterize the minimal sufficient capacity for implicit reasoning during pretraining. Our findings provide principled guidance for matching model size to data complexity and offer new insights into the scaling behavior of reasoning in large language models.
Lay Summary: Large language models can often make new connections from facts they have learned, but we still do not know how large a model needs to be for this kind of reasoning to appear. Simply making models bigger is expensive, and it is unclear whether bigger models are always necessary.
To study this question, we built simplified worlds made of connected facts, such as “A is related to B” and “B is related to C,” and trained models to learn from these facts. We then tested whether the models could infer missing connections they had never directly seen. By changing the size and complexity of these worlds, we measured how much model capacity was needed for successful reasoning.
We found that the best model size is not always the largest one. Instead, each world often has a smallest sufficient model size, and this size grows with how hard the world is to search through. This suggests that reasoning ability depends not just on model scale, but also on the structure of the data. Our results can help researchers choose model sizes more efficiently and better understand when reasoning emerges in language models.
Link To Code: https://github.com/WANGXinyiLinda/reasoning-scaling-law
Primary Area: Deep Learning->Theory
Keywords: science of language models, reasoning, scaling
Originally Submitted PDF: pdf
Submission Number: 17934
Loading