Abstract: Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors—like the base model quality and number of expert models— to affect the merged model’s performance. This work systematically evaluates the utility of model merging at scale for transformer based models to examine the impact of these different factors. We experiment with merging fully fine-tuned models using four popular merging methods—Averaging, Task Arithmetic, Dare-TIES, and TIES-Merging—across model sizes ranging from 1B to 64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert’s training tasks, and zero-shot generalization to unseen held-out tasks. Our wide range of experiments provide several new insights about merging transformer based language models at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance, compared to pre-trained ones. Second, larger models perform better when merged. Third merging consistently improves generalization capabilities. Notably, when merging eight large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=HW26XyHp3P
Changes Since Last Submission: **The previous submission was rejected because we got delayed in submitting our official rebuttal due to some unforseen personal circumstances even though it addressed all the concerns.** We respectfully note the Area Chair's comment acknowledging that these late updates "may strengthen the paper", while stating they had to be evaluated in a new submission as the rebuttal was late and they waned to ensure the fairness in the process.
This revised manuscript formally incorporates those improvements and fully addresses all feedback received. We hope that by formally integrating these changes now, we have satisfied the previous concerns and made the paper suitable for publication in TMLR. The primary concerns previously raised revolved around the limited scope of experimental evidence supporting the generality of our claims, the depth of analysis explaining the observed phenomena, and the quantification of uncertainty. We believe the current version substantially strengthens the paper by directly addressing these points:
1. **Expanded Empirical Evidence and Model Diversity:**
* **Concern:** The central criticism was that our findings were derived solely from the PaLM-2 model family, limiting the generalizability of the conclusions.
* **Action:** We have conducted extensive new experiments using the **Llama-2 model family (7B, 13B, and 70B parameters)**, merging checkpoints fine-tuned on distinct tasks (WizardMath and CodeLlama) using the same four merging techniques evaluated previously. These results, which show consistent trends with our PaLM-2 findings (e.g., regarding the benefits of scale, the convergence of methods, and performance relative to experts), are now fully integrated into the manuscript (Section 4.6). This significantly broadens the empirical basis of our claims beyond a single model family.
* **Action:** We have carefully revised the text throughout the paper (Abstract, Introduction, Experiments, Conclusion) to more precisely define the scope of our claims, focusing on transformer-based language models, while now drawing evidence from both PaLM-2 and Llama-2 architectures.
2. **Quantification of Uncertainty:**
* **Concern:** The lack of error bars or statistical indicators made it difficult to assess the significance of observed differences.
* **Action:** We have added **standard deviations** to all relevant results presented in the tables (Appendix C) across different experimental conditions (model sizes, number of experts, merging methods). This allows for a quantitative assessment of the variability and significance of our findings. We did not add these std in the main plots as the get cluttered and the main point is lost.
3. **Clarifications and Presentation Improvements:**
* **Concern:** Various points regarding clarity of methodology, terminology, and presentation were raised.
* **Action:** We have incorporated numerous clarifications based on reviewer feedback:
* The normalization procedure for performance metrics is now more precisely defined (Section 3).
* Vague terminology like "easier" merging has been replaced with precise descriptions (e.g., "performs better," linked to normalized accuracy).
* Experimental details, such as the setup for "trials" when merging different numbers of experts, are further elaborated (Section 4.1).
* The calculation and interpretation of the multitask baseline results, especially regarding normalized scores, have been clarified (Section 3, Appendix).
* Potentially ambiguous terms like "model noise" have been defined in context (Section 4.3).
* Minor writing and formatting suggestions (e.g., section titles, consistent formatting) have been addressed.
4. **Deepened Analysis and Explanations:**
* **Concern:** Reviewers noted that the paper reported correlations (e.g., larger models merge "better") but offered limited insight into the underlying reasons.
* **Action:** We have expanded our discussion sections (especially Section 5: Discussion and Conclusion) to provide deeper intuition and potential explanations for key findings. This includes leveraging concepts like **Linear Mode Connectivity (LMC)** and relating our observations to prior work (e.g., WISE-FT, TIES, DARE) to discuss *why* factors like model scale and instruction tuning influence merging effectiveness, and why merging can sometimes outperform multitask learning.
We are grateful for the detailed feedback received during the previous review cycle. We believe these comprehensive revisions directly address the core concerns raised, significantly strengthening the empirical support for our conclusions within the defined scope and enhancing the overall clarity and depth of the manuscript.
Assigned Action Editor: ~Yen-Chang_Hsu1
Submission Number: 4721
Loading