Keywords: loss landscape, mode connectivity, model merging, optimization, implicit bias
Abstract: Model merging methods combine models with different capabilities into a single
one while maintaining the same inference cost. Two popular approaches are lin-
ear interpolation, which linearly interpolates between model weights, and task
arithmetic, which combines task vectors obtained by the difference between fine-
tuned and base models. While useful in practice, what properties make merging
effective are poorly understood. This paper explores how the optimization pro-
cess affects the loss landscape geometry and its impact on merging success. We
show that a single quantity – the effective noise scale – unifies the impact of opti-
mizer and data choices on model merging. Across architectures and datasets, the
effectiveness of merging success is a non-monotonic function of effective noise,
with a distinct optimum. Decomposing this quantity, we find that larger learning
rates, stronger weight decay, smaller batch sizes, and data augmentation all inde-
pendently modulate the effective noise scale, exhibiting the same qualitative trend.
Unlike prior work that connects optimizer noise to the flatness or generalization of
individual minima, we show that it also affects the global loss landscape, predict-
ing when independently trained solutions can be merged. Our findings broaden
the understanding of how optimization shapes the loss landscape geometry and its
downstream consequences for model merging, suggesting the possibility of fur-
ther manipulating the training dynamics to improve mergeability.
Primary Area: optimization
Submission Number: 11020
Loading