Keywords: loss landscape, mode connectivity, model merging, optimization, implicit bias
Abstract: Model merging combines independent solutions with different capabilities into a single one while maintaining the same inference cost.
Two popular approaches are _linear interpolation_, which simply averages multiple model weights, and _task arithmetic_, which combines task vectors obtained by the difference between finetuned and base models.
While useful in practice, what properties make merging effective are poorly understood. This paper explores how the optimization dynamics affect the loss landscape geometry and its impact on merging success.
We show that a single quantity -- the _effective noise scale_ -- unifies the impact of different optimizer components on model merging. Across architectures and datasets, merging success is a non-monotonic function of the effective noise scale, with a distinct optimum. Decomposing this quantity, we find that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale and exhibit the same qualitative trend.
Unlike prior work connecting optimizer noise to the flatness or generalization of _individual_ minima, we show that it also affects the _global_ loss landscape, predicting when independently trained solutions can be successfully merged. Our findings broaden the understanding of how optimization shapes the loss landscape geometry and its consequences for model merging, suggesting that training dynamics could be further manipulated to improve model merging.
Primary Area: optimization
Submission Number: 11020
Loading