Keywords: data scarcity; supervised learning; data augmentation
TL;DR: Proposes a method for model trainers to acquire task-relevant data from external data providers
Abstract: Machine learning (ML) models often require large amounts of data to perform well.
When the available data is limited, model trainers may need to acquire more data from external sources.
Often, useful data is held by private entities who are hesitant to share their data due to propriety and privacy concerns.
This makes it challenging and expensive for model trainers to acquire the data they need to improve model performance.
To address this challenge, we propose $\texttt{Mycroft}$, a
data-efficient method that enables model trainers to evaluate the relative
utility of different data sources while working with a constrained data-sharing
budget. By leveraging feature space distances and gradient matching, $\texttt{Mycroft}$
identifies small but informative data subsets from each owner, allowing model
trainers to maximize performance with minimal data exposure. Experimental
results across four tasks in two domains show that $\texttt{Mycroft}$ converges rapidly
to the performance of the full-information baseline, where all data is shared.
Moreover, $\texttt{Mycroft}$ is robust to noise and can effectively rank data owners by
utility. $\texttt{Mycroft}$ can pave the way for democratized training of high performance ML models.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7326
Loading