The Dynamics of Functional Diversity throughout Neural Network TrainingDownload PDF

28 Sept 2023OpenReview Archive Direct UploadReaders: Everyone
Abstract: Deep ensembles offer reduced generalization error and improved predictive uncertainty estimates. These performance gains are attributed to functional diversity among the component models that make up the ensembles: ensemble performance increases with the diversity of the components. A standard way to generate a diversity of components is to train multiple networks on the same data, using different minibatch orders, augmentations, etc. In this work, we focus on how and when this type of diversity in the learned predictor decreases throughout training. In order to study the diversity of networks still accessible via SGD after t iterations, we first train a single network for t iterations, then duplicate the state of the optimizer and finish the remainder of training k times, with independent randomness (minibatches, augmentations, etc) for each duplicated network. The result is k distinct networks whose training has been coupled for t iterations. We use this methodology—recently exploited for k = 2 to study linear mode connectivity—to construct a novel probe for studying diversity. We find that coupling k for even a few epochs severely restricts the diversity of functions accessible by SGD, as measured by the KL divergence between the predicted label distributions as well as the calibration and test error of k-ensembles. We also find that the number of forgetting events [1] drops off rapidly. The amount of independent training time decreases with coupling time t however. To control for this confounder, we study extending the number of iterations of high-learning-rate optimization for an additional t iterations post-coupling. We find that this does not restore functional diversity. We also study how functional diversity is affected by retraining after reinitializing the weights in some layers. We find that we recover significantly more diversity by reinitializing layers closer to the input layer, compared to reinitializing layers closer to the output. In this case, we see that reinitialization upsets linear mode connectivity. This observation agrees with the performance improvements seen by architectures that share the core of a network but train multiple instantiations of the input layers [2].
0 Replies

Loading