Pathwise EMA: An Intrinsic Clock for Weight Averaging

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Optimization, Exponential Moving Average, Large Language Models, Adaptive Regularization, Scaling Laws, Training Stability
TL;DR: This paper introduces Pathwise EMA (PEMA), a stabilization method that replaces traditional time-based weight averaging with an adaptive decay rule based on the distance model parameters travel in weight space.
Abstract: Exponential moving averages of model weights are widely used to stabilize deep learning training, but standard EMA introduces decay, offset, and update-frequency hyperparameters that must be retuned across learning-rate schedules, batch sizes, and model scales. We ask whether EMA can instead be made adaptive to the optimization trajectory itself. We propose Pathwise EMA (PEMA), a parameter-free EMA scheme that replaces time-based decay with a decay rule based on the normalized path length traveled by the model parameters in weight space. The central intuition is that path length acts as an intrinsic clock for training: high-velocity or noisy trajectories require stronger smoothing, whereas slower trajectories require less smoothing to avoid lag. Across supervised fine-tuning experiments on SmolLM2, Qwen, and Gemma models, PEMA consistently matches or outperforms the best tuned standard EMA across sweeps over learning rate, minimum learning rate, batch size, and update frequency. These results suggest that path-based averaging can provide a simple, robust stabilizer for language model fine-tuning while substantially reducing the hyperparameter tuning burden of EMA.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 46
Loading