Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks

Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks

TMLR Paper4797 Authors

06 May 2025 (modified: 06 Jun 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent methods aim to merge neural networks (NNs) with identical architectures trained on different tasks into a single multi-task model. While most works focus on the simpler setup of merging NNs initialized from a common pre-trained network, we target the harder problem of merging large transformers trained on different tasks from distinct initializations. We show that traditional merging methods fail catastrophically in this setup, while Knowledge Distillation (KD) achieves much better results, though at a higher cost. However, KD is data-inefficient, as it does not exploit the original models' weights. To solve this, we introduce "Foldable SuperNet Merge" (FS-Merge), which trains a SuperNet containing the original models (with frozen weights) using a feature reconstruction objective. After training, the SuperNet is folded back to the size of a single original model. FS-Merge is simple, data-efficient, has a computational cost comparable to KD, and is proven to have superior expressiveness over traditional merging methods. It achieves SOTA results when tested on MLPs and transformers across various sizes, tasks, modalities, and distribution shifts, especially in low-data scenarios.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We have refined our work and added new experiments, as requested by the reviewers. The changes are highlighted in blue.

Assigned Action Editor: ~Yannis_Kalantidis2

Submission Number: 4797

Loading