Merging Text Transformer Models from Different Initializations

Published: 16 Jun 2024, Last Modified: 17 Jul 2024HiLD at ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mode connectivity, model merging, loss landscapes
TL;DR: Merging transformer minima via our proposed method shows reduced loss barriers compared to vanilla averaging.
Abstract: Recent work on one-shot permutation-based model merging has shown impressive low- or zero- barrier mode connectivity between models from completely different initializations. However, this line of work has not yet extended to Transformers, despite their dominant popularity in the language domain. Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape. The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class. In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging for several models trained on a masked-language modeling task. Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on further understanding Transformer solutions.
Student Paper: Yes
Submission Number: 45
Loading