The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions
TL;DR: Tiny perturbations cause neural network training to diverge to distinct loss basins
Abstract: Neural network training is inherently sensitive to initialization and the randomness induced by stochastic gradient descent. However, it is unclear to what extent such effects lead to meaningfully different networks, either in terms of the models' weights or the underlying functions that were learned. In this work, we show that during the initial "chaotic" phase of training, even extremely small perturbations reliably causes otherwise identical training trajectories to diverge-an effect that diminishes rapidly over training time. We quantify this divergence through (i) $L^2$ distance between parameters, (ii) the loss barrier when interpolating between networks, (iii) $L^2$ and barrier between parameters after permutation alignment, and (iv) representational similarity between intermediate activations; revealing how perturbations across different hyperparameter or fine-tuning settings drive training trajectories toward distinct loss minima. Our findings provide insights into neural network training stability, with practical implications for fine-tuning, model merging, and diversity of model ensembles.
Lay Summary: Due to noise, two neural networks trained from the same random starting point can learn one of many different solutions to the same problem, whereas pre-trained networks tend to learn the same solution. What we don’t know is, when and how do networks switch from learning different solutions to the same solution? To answer this question, we train twin copies of neural networks in exactly the same way, but add a tiny change (perturbation) to one of the copies during training. We find that for networks at random starting points, even the tiniest change (far smaller than typical random effects) causes training to learn different solutions, whereas pre-trained networks only learn different solutions when changes much larger than random effects are applied. Our findings are significant because we often need to retrain and combine knowledge from several huge networks (such as large language models). As some methods work better with similar solutions versus different solutions, we can tailor our retraining or model combining methods to best target each case.
Link To Code: https://github.com/gsaltintas/lmc
Primary Area: Deep Learning->Theory
Keywords: early training, linear mode connectivity, loss landscape, perturbation sensitivity, permutation symmetry
Submission Number: 14603
Loading