Parameter Averaging Laws for Multitask Language Models

Published: 28 Oct 2023, Last Modified: 13 Dec 2023FL@FM-NeurIPS’23 PosterEveryoneRevisionsBibTeX
Student Author Indication: Yes
Keywords: parameter averaging, multitask language model, federated learning, pretrained models, multilingual language model
TL;DR: We study the conditions for successful parameter-averaging and explore the possibility of partial averaging.
Abstract: Parameter-averaging, a method for combining multiple models into a single one, has emerged as a promising approach to enhance performance without requiring additional space or retraining. Nonetheless, the conditions for successful parameter-averaging remain undefined, calling for further research to characterize them. In this study, we empirically investigate the influential factors for successful parameter-averaging and reveal \emph{positive correlations between representation power and the performance gain of parameter-averaging}. Specifically, we evaluate how computational budget, data diversity and vocabulary size contribute to representation power, and their influence on the success of parameter-averaging. Our results demonstrate that parameter-averaging improves the generalization ability for both in-domain and out-of-domain data. Additionally, to reduce the computational cost of parameter-averaging, we introduce \textit{partial averaging}, which assumes arbitrary participation of a subset of contributors. We observe that partial averaging outperforms fine-tuning for models with sufficient representation power. Furthermore, we find that the impact of data heterogeneity, which arises from different data distributions of contributors, reduces as the representation power of the model increases. These findings provide valuable insights into the principles governing parameter-averaging and its potential for enhancing model performance.
Submission Number: 17