Scalable Model Merging with Progressive Layer-wise Distillation

Jing Xu; Jiazheng Li; Jingzhao Zhang

Scalable Model Merging with Progressive Layer-wise Distillation

Jing Xu, Jiazheng Li, Jingzhao Zhang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose an effective and scalable model merging algorithm based on progressive layer distillation.

Abstract: Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performances. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer-wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods,ProDistill achieves state-of-the-art performance, with up to 6.14\% and 6.61\% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.

Lay Summary: Combining the strengths of different AI models is a promising way to build more capable systems, but doing so often leads to worse performance — especially when we don’t have much new data to guide the process. Our research shows that merging models without data can actually have bad performance, and that using at least a small amount of task-specific data is important to get good results. Based on this, we introduce a new method called ProDistill, which carefully combines models by having one model teach another, layer by layer — similar to how a teacher might guide a student through a complex topic step by step. While many researchers thought this approach might hurt performance, we found the opposite: ProDistill not only makes merging more accurate, but it also works well even for very large AI models. In tests across image and language tasks, it consistently outperformed other methods. This makes it a powerful tool for building stronger AI systems with minimal data.

Link To Code: https://github.com/JingXuTHU/Scalable_Model_Merging_with_Progressive_Layerwise_Distillation

Primary Area: General Machine Learning->Transfer, Multitask and Meta-learning

Keywords: Model Merging, Task Vector, Distillation

Submission Number: 3718

Loading