[Short] Exploration into gradient-based coreset methods for targeted subset selection

Evelyn Zhu; Neha Hulkund; Sara Beery

[Short] Exploration into gradient-based coreset methods for targeted subset selection

Evelyn Zhu, Neha Hulkund, Sara Beery

Published: 02 Mar 2026, Last Modified: 02 Apr 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0

Keywords: foundation models, targeted subset selection, coreset selection, data curation

Abstract: In real-world machine learning applications (e.g. detecting broken bones in x-rays), models are deployed in specific settings (e.g. a particular hospital), rather than the domain broadly. Discrepancies between training and deployment distributions lead to suboptimal performance, highlighting the need to curate training data for $\text{\textit{fine-tuned foundation models for specific deployment needs}}$. In this work, we propose a novel algorithm called $\texttt{Grad-Match-ACF}$ and evaluate its performance against traditional coreset methods for targeted data subset selection to fine-tune specialized vision models. While traditional coreset methods aim to approximate the training distribution, $\texttt{Grad-Match-ACF}$ reformulates coreset selection for out-of-distribution targets by explicitly aligning the coreset budget with the label class frequencies of the deployment. We demonstrate that $\texttt{Grad-Match-ACF}$ performs the best across most deployments on the DataSˆ3 benchmark. Beyond better aligning with the objective of targeted subset selection, $\texttt{Grad-Match-ACF}$ achieves up to a 18x speed-up improvement over state-of-the-art gradient-matching coreset methods for real-world, long-tailed deployments. While $\texttt{Grad-Match-ACF}$ shows more reliability than traditional gradient-matching coreset methods for selecting on large datasets, there is still room for improving the scalability of these methods for targeted subset selection.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 72

Loading