Keywords: Ensemble learning, Transfer learning
Abstract: In this work, we present a simple algorithm for ensembling checkpoints from a single training trajectory (trajectory ensembling) resulting in significant gains for several fine tuning tasks. We compare against classical ensembles and perform ablation studies showing that the important checkpoints are not necessarily the best performing models in terms of accuracy. Rather, relatively poor models with low loss are vital for the observed performance gains. We also investigate various mixtures of checkpoints from several independent training trajectories, making the surprising observation that this only leads to marginal gains in this setup. We study how calibrating constituent models with a simple temperature scaling impacts results, and find that the most important region of training is still that of the lowest loss in spite of potential poor accuracy.