Abstract: Training Deep Neural Networks (DNNs) places immense compute requirements on the underlying hardware platforms, expending large amounts of time and energy. An important factor contributing to the long training times is the increasing dataset complexity required to reach state-of-the-art performance in real-world applications. To address this challenge, we explore the use of input mixing, where multiple inputs are combined into a single composite input with an associated composite label for training. The goal is for training on the mixed input to achieve a similar effect as training separately on each the constituent inputs that it represents. This results in a lower number of inputs (or mini-batches) to be processed in each epoch, proportionally reducing training time.
We find that naive input mixing leads to a considerable drop in learning performance and model accuracy due to interference between the forward/backward propagation of the mixed inputs. We propose two strategies to address this challenge and realize training speedups from input mixing with minimal impact on accuracy. First, we reduce the impact of inter-input interference by exploiting the spatial separation between the features of the constituent inputs in the network’s intermediate representations. We also adaptively vary the mixing ratio of constituent inputs based on their loss in previous epochs. Second, we propose heuristics to automatically identify the subset of the training dataset that is subject to mixing in each epoch. For ResNets of varying depth and MobileNetV2, we obtain upto 1.6x and 1.8x speedups in training for the ImageNet and Cifar10 datasets, respectively, on an Nvidia RTX 2080Ti GPU, with negligible loss in classification accuracy.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Corrected typographical errors pointed out by reviewers
- Elaborated discussion on obtaining loss of constituent inputs in previous epochs, during adaptive mixing (Sec. 3.1, page 5)
- Extended experiments on applying MixTrain with different optimizers (Table 5, Sec. 7.2)
- Added experiments on vision transformers (Table 2, Sec. 4.1)
- Added experiments, comparing proposed approach against existing efforts (Table 3, Sec. 4.3)
- Added experiments on applying proposed approach with different optimizers (Table 4, Sec. 7.3)
- Added training runtime details for baseline and proposed approach (Table 5, Sec. 7.4)
- Added impact of proposed approach on accuracy and speed-up when mixing more than 2 inputs (Table 6, Sec. 7.5)
- Added discussion on applicability of proposed approach to tasks other than image classification (Sec. 7.8)
Assigned Action Editor: ~Neil_Houlsby1
Submission Number: 1061
Loading