Mycroft: Towards Effective and Efficient External Data Augmentation

Zain Sarwar; Van Tran; Arjun Nitin Bhagoji; Nick Feamster; Ben Y. Zhao; Supriyo Chakraborty

Mycroft: Towards Effective and Efficient External Data Augmentation

Zain Sarwar, Van Tran, Arjun Nitin Bhagoji, Nick Feamster, Ben Y. Zhao, Supriyo Chakraborty

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: data scarcity; supervised learning; data augmentation

TL;DR: Proposes a method for model trainers to acquire task-relevant data from external data providers

Abstract: Machine learning (ML) models often require large amounts of data to perform well. When the available data is limited, model trainers may need to acquire more data from external sources. Often, useful data is held by private entities who are hesitant to share their data due to propriety and privacy concerns. This makes it challenging and expensive for model trainers to acquire the data they need to improve model performance. To address this challenge, we propose $\texttt{Mycroft}$, a data-efficient method that enables model trainers to evaluate the relative utility of different data sources while working with a constrained data-sharing budget. By leveraging feature space distances and gradient matching, $\texttt{Mycroft}$ identifies small but informative data subsets from each owner, allowing model trainers to maximize performance with minimal data exposure. Experimental results across four tasks in two domains show that $\texttt{Mycroft}$ converges rapidly to the performance of the full-information baseline, where all data is shared. Moreover, $\texttt{Mycroft}$ is robust to noise and can effectively rank data owners by utility. $\texttt{Mycroft}$ can pave the way for democratized training of high performance ML models.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7326

Loading