$k$-Mixup Regularization for Deep Learning via Optimal Transport
Abstract: Mixup is a popular regularization technique for training deep neural networks that improves generalization and increases robustness to certain distribution shifts. It perturbs input training data in the direction of other randomly-chosen instances in the training set. To better leverage the structure of the data, we extend mixup in a simple, broadly applicable way to $k$-mixup, which perturbs $k$-batches of training points in the direction of other $k$-batches. The perturbation is done with displacement interpolation, i.e. interpolation under the Wasserstein metric. We demonstrate theoretically and in simulations that $k$-mixup preserves cluster and manifold structures, and we extend theory studying the efficacy of standard mixup to the $k$-mixup case. Our empirical results show that training with $k$-mixup further improves generalization and robustness across several network architectures and benchmark datasets of differing modalities. For the wide variety of real datasets considered, the performance gains of $k$-mixup over standard mixup are similar to or larger than the gains of mixup itself over standard ERM after hyperparameter optimization. In several instances, in fact, $k$-mixup achieves gains in settings where standard mixup has negligible to zero improvement over ERM.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=4Cx6GHd99J
Changes Since Last Submission: * Reframing of adversarial robustness experiments as evidence of robustness to certain forms of distribution shift. In this vein, experiments demonstrating superior performance in the presence of added Gaussian noise have been added (Sec 5). * Clarified language in the abstract on the improvements of $k$-mixup over 1-mixup and ERM: "For the wide variety of real datasets considered, the performance gains of k-mixup over standard mixup are similar to or larger than the gains of mixup itself over standard ERM after hyperparameter optimization. In several instances, in fact, $k$-mixup achieves gains in settings where standard mixup has negligible to zero improvement over ERM." Further clarifications within the main text were also added. * New commentary on the superiority of our method over a $k$-NN based matching strategy in supplement Section J and Figure 7 (illustrations of the poorer matching distributions of $k$-NN). * Fig. 4 has been added to clarify the notion of injectivity radius, and a reference for homotopy equivalence (Hatcher) in case a full definition is desired. * Moved toy datasets experiments to the appendix, as they are not real datasets. Included for completeness. * Added an additional error bar discussion in footnote 6, clarifying our previously reported intervals.
Assigned Action Editor: ~Yann_Dauphin1
Submission Number: 1084