Abstract: Model merging has emerged as a practical and cost-effective approach for combining multiple pretrained models into a single model that inherits their capabilities and often achieves improved performance. Its growing popularity has led to the rapid development of numerous merging techniques. However, these methods are typically evaluated in disparate experimental settings and make differing assumptions about model architecture, data availability, and computational budget, making direct comparison difficult. In this work, we systematically characterize the relative strengths and limitations of existing merging methods by evaluating them within a unified experimental framework. Our study focuses on compositional generalization --- \ie whether merging can successfully combine distinct skills to generalize to new settings. We also analyze the computational costs of each method and examine how performance scales as the number of merged models increases. Overall, we evaluate eight merging methods in a novel benchmark spanning three distinct cross-modal settings, resulting in 12,000 unique merge configurations. Our findings reveal the absence of a one-size-fits-all merging strategy and serves as both an outline for the holistic evaluation of future merging methods as well as a cookbook for practitioners using model merging.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yaoyao_Liu1
Submission Number: 8017
Loading