Abstract: The remarkable success of large language pretraining and the discovery of the empirical
scaling laws signify a paradigm shift in machine learning. Notably, the primary objective
has evolved from minimizing generalization error to reducing approximation error, and the
most effective strategy has transitioned from regularization (in a broad sense) to scaling
up models. This raises a critical question:
Do the established principles that proved successful in the generalization-centric era remain
valid in this new era of scaling?
This paper examines several influential regularization-based principles that may no longer
hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed “scaling law
crossover,” where two scaling curves intersect at a certain scale, implying that methods
effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm:
• Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling?
• Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Bruno_Loureiro1
Submission Number: 5286
Loading