The Impact of Coreset Selection on Spurious Correlations and Group Robustness

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Coreset Selection, Data Selection, Data Pruning, Group Robustness, Spurious Correlations, Bias, Spurious Bias
TL;DR: A comprehensive empirical study on how coreset selection methods impact bias and group robustness of downstream models.
Abstract: Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning. However, many large real-world datasets suffer from unknown spurious correlations and hidden biases. Therefore, it is crucial to understand how such biases would affect downstream tasks via the selected coresets. In this work, we conduct the first comprehensive analysis of the implications of data selection on the bias levels of the selected coresets and the robustness of downstream models trained on them. We use an extensive experimental setting spanning ten different spurious correlations benchmarks, five score metrics to characterize sample importance/ difficulty, and five data selection policies across a broad range of coreset sizes to identify important patterns and derive insights. Thereby, we unravel a series of nontrivial nuances in well-known interactions between sample difficulty and bias alignment, as well as dataset bias and resultant model robustness. For example, we show that embedding-based sample characterizations run a comparatively lower risk of inadvertently exacerbating bias when used for selecting coresets compared to characterizations based on learning dynamics. Our analysis also reveals that lower bias levels achieved by coresets of difficult samples do not reliably guarantee downstream robustness. Most importantly, we show that special considerations need to be made when the coreset size is very small, since there is a unique risk of highly prototypical coresets reaching high average performance while obscuring their low group-robustness.
Code URL: https://github.com/theamaya/Robustness-impacts-of-coreset-selection
Supplementary Material: pdf
Primary Area: Evaluation (e.g., data collection methodology, data processing methodology, data analysis methodology, meta studies on data sources, extracting signals from data, replicability of data collection and data analysis and validity of metrics, validity of data collection experiments, human-in-the-loop for data collection, human-in-the-loop for data evaluation)
Submission Number: 1911
Loading