Abstract: Deep ensembles offer reduced generalization error and improved predictive uncertainty estimates.
These performance gains are attributed to functional diversity among the component models that
make up the ensembles: ensemble performance increases with the diversity of the components. A
standard way to generate a diversity of components is to train multiple networks on the same data,
using different minibatch orders, augmentations, etc. In this work, we focus on how and when this
type of diversity in the learned predictor decreases throughout training.
In order to study the diversity of networks still accessible via SGD after t iterations, we first train a
single network for t iterations, then duplicate the state of the optimizer and finish the remainder of
training k times, with independent randomness (minibatches, augmentations, etc) for each duplicated
network. The result is k distinct networks whose training has been coupled for t iterations. We use
this methodology—recently exploited for k = 2 to study linear mode connectivity—to construct a
novel probe for studying diversity.
We find that coupling k for even a few epochs severely restricts the diversity of functions accessible
by SGD, as measured by the KL divergence between the predicted label distributions as well as the
calibration and test error of k-ensembles. We also find that the number of forgetting events [1] drops
off rapidly.
The amount of independent training time decreases with coupling time t however. To control for this
confounder, we study extending the number of iterations of high-learning-rate optimization for an
additional t iterations post-coupling. We find that this does not restore functional diversity.
We also study how functional diversity is affected by retraining after reinitializing the weights in some
layers. We find that we recover significantly more diversity by reinitializing layers closer to the input
layer, compared to reinitializing layers closer to the output. In this case, we see that reinitialization
upsets linear mode connectivity. This observation agrees with the performance improvements seen by
architectures that share the core of a network but train multiple instantiations of the input layers [2].
0 Replies
Loading