Scaling Law for Code: A More Data-Hungry Regime

19 Sept 2025 (modified: 06 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Scaling Law, Code LLMs, Pretrain
Abstract: The training of large language models (LLMs) for code generation incurs substantial computational costs, yet the resource allocation strategies are often guided by scaling laws derived from natural language (NL). Given the distinct statistical properties of code, it is unclear if these heuristics are optimal. This paper presents the first large-scale, systematic investigation into the scaling laws of LLMs trained specifically on source code. We conduct 117 training runs, spanning model sizes from 0.2B to 3.8B parameters and dataset sizes from 2B to 128B tokens, to derive a precise scaling law for code. Our findings show that while code models adhere to the existing Farseer scaling law paradigm, they operate in a fundamentally different, ``more data-hungry'' regime. The compute-optimal data-to-parameter (D/N) ratio for code is significantly higher than for NL and accelerates with the compute budget. This suggests that the primary bottleneck for state-of-the-art code models is data availability, not diminishing returns from model size. Furthermore, through two additional sets of 117 experiments on code-NL mixtures, we find that while adding NL data can benefit smaller models in low-data scenarios, pure in-domain data is superior for larger-scale, compute-optimal training. Our findings provide a more refined, practical guide for the compute-optimal pre-training of LLMs for code.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16370
Loading