Keywords: weight averaging, model averaging, model merging, permutation, communication, distributed, parallel, ensembling
TL;DR: We introduce WASH, a distributed training method that train a population of models to achieve high performance when averaged, by permuting randomly a small fraction of parameters during training.
Abstract: Deep neural networks' performance is enhanced by ensemble methods, averaging the output of several models at an increased inference cost. Weight averaging methods aim at avoiding this issue by merging the models, but naive averaging results in poor performance for models in different loss basins. Distributed training methods like DART and PAPA have been proposed to train several models in parallel in the same basin but at the cost of ensembling accuracy and significant communication costs between models. We introduce WASH, a novel distributed method that outperforms previous approaches by randomly shuffling a small percentage of model weights during training, for a much lower communication cost.
Submission Number: 51
Loading