Hide-and-Seek: Data Sharing with Customizable Machine Learnability and Privacy

Hairuo Xu, Tao Shu

Published: 01 Jan 2024, Last Modified: 15 May 2025ICCCN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the immense amount of publicly available data online, many companies and research institute are able to download the online data for free and train the machine learning models which will finally result in products that would enhance our everyday life. While enjoying the advantages of such large amount of free data, people (data providers or data owners) have the concern that their personal data may be crawled without the owner’s consent. This brings out an underlying issue in the context of machine learning that in the current literature and applications, dataset owners (also referred to as "dataset providers" in the following text) can only choose between the two extreme decisions of either to share their data entirely, or not share any of their data at all. Another side of this issue is that the privacy of the dataset to be shared is either completely revealed due to the full disclosure of the dataset, which benefits the potential consumers of the dataset (referred to as dataset user/buyer in the following text); or the dataset is not shared at all which preserves the privacy, but impede the development of new technologies.In this paper, we propose the novel Hide-and-Seek data sharing framework that serves as a middle point between the difficult "share or no share" extreme decisions, which provides a "partial share" option based on the consumers’ needs, and hence is able to protect the partial privacy of the dataset providers while sharing enough amount of data needed for the user to train their models at a desired accuracy. Extensive amount of experiments have been conducted on the CIFAR-10, Street View House Number (SVHN), and the CIFAR-100 datasets. Our experimental results verify the effectiveness of the proposed Hide-and-Seek framework. We also show in the experiments that our framework is able to protect data provider’s privacy without changing the visual patterns of the dataset, and therefore, doesn’t affect the regular usage of the data (such as using it as a profile photo).