Keywords: foundation models, targeted subset selection, coreset selection, data curation
Abstract: In real-world machine learning applications (e.g. detecting broken bones in x-rays), models are deployed in specific settings (e.g. a particular hospital), rather than the domain broadly. Discrepancies between training and deployment distributions lead to suboptimal performance, highlighting the need to curate training data for $\text{\textit{fine-tuned foundation models for specific deployment needs}}$. In this work, we propose a novel algorithm called $\texttt{Grad-Match-ACF}$ and evaluate its performance against traditional coreset methods for targeted data subset selection to fine-tune specialized vision models. While traditional coreset methods aim to approximate the training distribution, $\texttt{Grad-Match-ACF}$ reformulates coreset selection for out-of-distribution targets by explicitly aligning the coreset budget with the label class frequencies of the deployment. We demonstrate that $\texttt{Grad-Match-ACF}$ performs the best across most deployments on the DataSˆ3 benchmark. Beyond better aligning with the objective of targeted subset selection, $\texttt{Grad-Match-ACF}$ achieves up to a 18x speed-up improvement over state-of-the-art gradient-matching coreset methods for real-world, long-tailed deployments. While $\texttt{Grad-Match-ACF}$ shows more reliability than traditional gradient-matching coreset methods for selecting on large datasets, there is still room for improving the scalability of these methods for targeted subset selection.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 72
Loading