Selective Mixup Helps with Distribution Shifts, But Not (Only) because of Mixup

Damien Teney; Jindong Wang; Ehsan Abbasnejad

Selective Mixup Helps with Distribution Shifts, But Not (Only) because of Mixup

Damien Teney, Jindong Wang, Ehsan Abbasnejad

16 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: pdf

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: mixup, distribution shifts, OOD generalization, weighted training

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Selective mixup (a family of methods very successful at improving out-of-distribution generalization) is sometimes equivalent to weighted sampling, a classical baseline for handling covariate and label shift.

Abstract:

Mixup is a highly successful technique to improve the generalization of neural networks by augmenting the training data with combinations of random pairs. Selective mixup is a family of methods that apply mixup to specific pairs, e.g. only combining examples across classes or domains. These methods have claimed remarkable improvements in benchmarks with distribution shifts, but their mechanisms and limitations remain poorly understood.

We examine an overlooked aspect of selective mixup that explains its success in a completely new light. We find that the non-random selection of pairs affects the training distribution and improves generalization by means completely unrelated to the mixing. For example, in binary classification, mixup across classes implicitly resamples the data for a uniform class distribution - a classical solution to label shift. We show empirically that this implicit resampling explains much of the improvements in prior work. Theoretically, these results rely on a "regression toward the mean", an accidental property that we identify in several datasets.

Takeaways: We have found a new equivalence between two successful methods: selective mixup and resampling. We identify the limits of the former, confirm the effectiveness of the latter, and find better combinations of their respective benefits.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 498

Loading