Keywords: model merging, task arithmetic, visual representation learning, human mesh recovery, mast3r, semantic segmentation, mapfree challenge, depth, adamerge, hyperparameter selection
TL;DR: We evaluate merging on diverse 2D and 3D vision tasks and propose TAS, a score that any merging method can use for selecting merging hyperparameters at a fraction of the cost.
Abstract: Efficiently merging several models fine-tuned for different tasks, but stemming from the same pre-trained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision focus on image classification using CLIP, where different sets of labels define different classification tasks. This paper ventures model merging into the more challenging setup where the different tasks operate in different output spaces and thus rely on different trainable decoders. This renders exhaustive hyperparameter search impractical. To address this, we introduce the task alignment score, and show how it can be used to i) speed up hyperparameter selection by orders of magnitude and ii) generalize proxy-based hyperparameter selection to any task. We also find in our setting several recent models fail and show it is largely due to some fine-tuned models being significantly further-away from the base than others, leading to a strong unbalance. More importantly, we demonstrate that, thanks to our contributions, model merging remains effective and can improve the performance of a state-of-the-art multi-task vision model.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9135
Loading