Machine Learning with Feature Differential Privacy
Keywords: Differential Privacy
TL;DR: We propose an algorithm to optimize a loss function while preserving the privacy of some of the features and leaking others.
Abstract: Machine learning applications incorporating differential privacy frequently face significant utility degradation. One prevalent solution involves enhancing utility through the use of publicly accessible information. Public data-points, wellknown for their utility-enhancing capabilities in private training, have received considerable attention. However, it is worth noting that these public sources can vary substantially in their nature. In this work, we explore the feasibility of leveraging public features from the private dataset. For instance, consider a tabular dataset in which some features are publicly accessible while others need to be kept private. We delve into this scenario, defining a concept we refer to as featureDP. We examine feature DP in the context of private optimization, and propose a solution based the widely used DP-SGD framework. Notably, our framework maintains the advantage of privacy amplification through sub-sampling, even while some features are disclosed. We analyze our algorithm for Lipschitz and convex loss functions and we establish privacy and excess empirical risk bounds. Importantly, due to our strategy’s ability to harness privacy amplification via sub-sampling, our excess risk bounds converge to zero as the number of data points increases. This enables us to improve upon previously understood excess risk bounds for label differential privacy, and provides a response to an open question proposed by (Ghazi et al., 2021). We applied our methodology to the Purchase100 dataset, finding that the public features facilitated by our framework can indeed improve the balance between utility and privacy.
Submission Number: 94