StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

Ranjith Merugu; Bryan Bo Cao; Shubham Jain

StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

Ranjith Merugu, Bryan Bo Cao, Shubham Jain

06 Apr 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: model merging, multitask, efficient neural network, efficient vision model

TL;DR: We present StatsMerging, the first learning-based merging model guided by weight distribution statistics, avoiding costly ground-truth labels by distilling knowledge from pre-trained task-specific models.

Abstract: As large models are increasingly deployed across various tasks, the limited GPU memory available for storing task-specific models presents a growing bottleneck. Model merging has emerged as a promising solution to accommodate multiple large models within constrained memory budgets. While traditional multi-task learning methods attempt to merge shared layers, they require labor-intensive annotated labels and incur significant computational overhead. Recent merging techniques aim to address this issue by combining models at inference time; however, these approaches often rely on simplistic heuristics, ignore weight distribution characteristics, assume architectural identity, or require access to test samples to infer merging coefficients, thereby limiting their generalization capability and scalability. We present **StatsMerging**, a novel lightweight learning-based model merging method guided by weight distribution statistics without training ground truth labels or test samples. StatsMerging offers three key advantages: (1) It uniquely leverages **singular values** from singular value decomposition (SVD) to capture task-specific weight distributions, serving as a proxy for task importance to guide task coefficient learning; (2) It employs a lightweight learner **StatsMergeLearner** to model the weight distributions of task-specific pre-trained models, improving generalization and enhancing adaptation to unseen samples; (3) It introduces **Task-Specific Teacher Distillation**, a merging training paradigm that avoids costly ground-truth labels by task-specific teacher distillation. Notably, we present two types of knowledge distillation, (a) distilling knowledge from task-specific models to train StatsMerge Learner; and (b) for the first time, distilling knowledge from models with different architectures prior to merging, following a distill-then-merge paradigm. Extensive experiments across eight tasks demonstrate the effectiveness of StatsMerging. Our results show that StatsMerging outperforms state-of-the-art techniques in terms of overall accuracy, generalization to unseen tasks, and robustness to image quality variations.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 640

Loading