ATM: Improving Model Merging by Alternating Tuning and Merging

Luca Zhou; Daniele Solombrino; Donato Crisostomi; Maria Sofia Bucarelli; Fabrizio Silvestri; Emanuele Rodolà

ATM: Improving Model Merging by Alternating Tuning and Merging

Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Fabrizio Silvestri, Emanuele Rodolà

26 Sept 2024 (modified: 24 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: model merging, task arithmetic, multi-task learning, task vectors

TL;DR: A novel framework for iteratively updating a multi-task model without data exchange between tasks.

Abstract: Model merging has recently emerged as a cost-efficient paradigm for Multi-task Learning (MTL). Among merging solutions, Task Arithmetic \citep{task-vectors} stands out for its simplicity and effectiveness. In this paper, we start by motivating the effectiveness of task vectors with their relation to multi-task gradients. We show that in the single epoch scenario, task vectors are exactly equivalent to gradients obtained by performing gradient descent in a multi-task setting, and still approximate the latter with further epochs. We further strengthen the explanation by showing that task vectors work best when equality is maintained and motivate their effectiveness in the general case by showing that most of the contribution in the total update is determined by the gradient of the first epoch. Guided by this parallel, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). Acting as a midpoint between model merging and multi-task gradient descent, ATM obtains state-of-the-art results with the same data and computing requirements. We first extensively evaluate our approach under diverse settings, demonstrating state-of-the-art performance, leading by an accuracy of up to 19\% in computer vision and 20\% in NLP over the best baselines. We then motivate its effectiveness empirically, showing increased orthogonality between task vectors and, theoretically, proving it to minimize an upper bound to the loss obtained by finetuning jointly on all tasks.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7293

Loading