Model Merging Scaling Laws in Large Language Models

Model Merging Scaling Laws in Large Language Models

ICLR 2026 Conference Submission18627 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scaling Laws, Model Merge, LLMs

Abstract: We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as \(1/k\) and links the floor and tail to properties of the base model and the diversity across domains. This law enables \emph{predictive planning}: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget—turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18627

Loading