BOSS: Diversity-Difficulty Balanced One-Shot Subset Selection for Data-Efficient Deep Learning

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: subset selection, data-efficient deep learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: In this paper, we propose to conduct diversity-difficulty Balanced One-shot Subset Selection (BOSS), aiming to construct an optimal subset for data-efficient deep learning.
Abstract: Subset or core-set selection offers a data-efficient way for training deep learning models by identifying important data samples so that the model can be trained using the selected subset with similar performance as trained on the full set. However, most existing methods tend to choose either diverse or difficult data samples, which is likely to form a suboptimal subset, leading to a model with compromised generalization performance. One key limitation is due to the misalignment with the underlying goal of subset selection as an optimal subset should faithfully represent the joint data distribution that is comprised of both feature and label information. To this end, we propose to conduct diversity-difficulty Balanced One-shot Subset Selection (BOSS), aiming to construct an optimal subset for data-efficient deep learning. Samples are selected into the subset so that a novel balanced core-set loss bound is minimized, which theoretically justifies the need to simultaneously consider both diversity and difficulty to form an optimal subset. The loss bound also unveils the key relationship between the type of data samples to be included in the subset and the subset size. This further inspires the design of an expressive importance function to optimally balance diversity and difficulty depending on the subset size. The proposed approach is inspired by a theoretical loss bound analysis and utilizes a fine-grained importance control mechanism. A comprehensive experimental study is conducted on both synthetic and real datasets to justify the important theoretical properties and demonstrate the superior performance of BOSS as compared with the competitive baselines.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4411
Loading