Distributionally Robust Coreset Selection under Covariate Shift

Published: 14 Jun 2025, Last Modified: 14 Jun 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Coreset selection, which involves selecting a small subset from an existing training dataset, is an approach to reducing training data, and various approaches have been proposed for this method. In practical situations where these methods are employed, it is often the case that the data distributions differ between the development phase and the deployment phase, with the latter being unknown. Thus, it is challenging to select an effective subset of training data that performs well across all deployment scenarios. We therefore propose Distributionally Robust Coreset Selection (DRCS). DRCS theoretically derives an estimate of the upper bound for the worst-case test error, assuming that the future covariate distribution may deviate within a defined range from the training distribution. Furthermore, by selecting instances in a way that suppresses the estimate of the upper bound for the worst-case test error, DRCS achieves distributionally robust training instance selection. This study is primarily applicable to convex training computation, but we demonstrate that it can also be applied to deep learning under appropriate approximations. In this paper, we focus on covariate shift, a type of data distribution shift, and demonstrate the effectiveness of DRCS through experiments.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Major updates in the camera-ready version from the accepted version are as follows: - Expressions (11) and after, we made explicit that both $\boldsymbol w$ (weights on training instances) and $\boldsymbol w^\prime$ (weights on validation instances) must be optimized. - In the formulations of optimization problems, we rewrote $\max_{\boldsymbol w} \min_{\boldsymbol\beta}$ as $\min_{\boldsymbol\beta} \max_{\boldsymbol w}$. - After the paper is accepted, we uploaded the code (that was provided as a supplementary material during review) to GitHub. So the code URL is provided while the supplementary material is deleted in the camera-ready.
Code: https://github.com/takeuchi-lab/DRCS
Assigned Action Editor: ~Jeff_Phillips1
Submission Number: 4043
Loading