Keywords: Scaling Laws, Model Merge, LLMs
Abstract: We study empirical scaling laws for language model merging measured by cross-entropy.
Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size.
We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts.
The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included.
Building on this, we present a simple theory that explains why gains fall roughly as \(1/k\) and links the floor and tail to properties of the base model and the diversity across domains. This law enables \emph{predictive planning}: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget—turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18627
Loading