Differential Privacy Under Class Imbalance: Methods and Empirical Insights

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We consider imbalanced learning under differential privacy, presenting negative results on some existing natural approaches and positive results on modified algorithms, along with empirical evaluations.
Abstract: Imbalanced learning occurs in classification settings where the distribution of class-labels is highly skewed in the training data, such as when predicting rare diseases or in fraud detection. This class imbalance presents a significant algorithmic challenge, which can be further exacerbated when privacy-preserving techniques such as differential privacy are applied to protect sensitive training data. Our work formalizes these challenges and provides a number of algorithmic solutions. We consider DP variants of pre-processing methods that privately augment the original dataset to reduce the class imbalance, alongside DP variants of in-processing techniques, which adjust the learning algorithm to account for the imbalance. For each method, we either adapt an existing imbalanced learning technique to the private setting or demonstrate its incompatibility with differential privacy. Finally, we empirically evaluate these privacy-preserving imbalanced learning methods under various data and distributional settings. We find that private synthetic data methods perform well as a data pre-processing step, while class-weighted ERMs are an alternative in higher-dimensional settings where private synthetic data suffers from the curse of dimensionality.
Lay Summary: Data with rare (but potentially important) events — fraudulent transactions or uncommon diseases — are hard for machine learning models to accurately classify, especially when we add privacy restrictions that protect people’s data. We found that popular quick fixes for this “class imbalance,” like copying rare examples, can actually shatter the strong privacy standard, known as differential privacy, to which we try to adhere. We investigated some ways to address this limitation: (i) a way to create realistic, differentially private synthetic data that boosts the rare class without exposing anyone’s records, and (ii) a training approach that lets the model pay extra attention to scarce cases while still respecting privacy limits. We then evaluated across eight real-world, class-imbalanced datasets, and discussed what worked well and what didn’t.
Primary Area: Social Aspects->Privacy
Keywords: differential privacy, imbalanced learning, synthetic data, binary classification
Submission Number: 6420
Loading