When, Where and Why to Average Weights?

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly benchmark this technique, we present an extensive evaluation of averaging techniques in modern Deep Learning, which we perform using AlgoPerf, a large-scale benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay, as suggested by recent literature. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains across all considered workloads, at the price of a minimal implementation and memory cost, while mildly improving generalization. Finally, we explore the relationship between averaging and learning rate annealing and show that combining the two achieves optimal performance.
Lay Summary: Modern neural networks are parametrized by many parameters that evolve throughout the training process, when the model learns to predict data from a training distribution. Previous works have shown that averaging the parameters across the values visited during training makes these models more robust and speeds up convergence. This works investigates this idea in details, considering seven different model architectures across six different machine learning tasks, benchmarking the effectiveness of averaging for modern deep learning. We show that averaging can indeed speed up training, saving valuable computational resources, and we find that doing so brings modest generalization gains. Finally, we highlight the connection between averaging and other important parts of the optimization pipeline.
Primary Area: Optimization->Large Scale, Parallel and Distributed
Keywords: Weight Averaging, Checkpoint Averaging, Optimization, Learning Rate Schedule.
Submission Number: 11761
Loading