Keywords: ideal data, fairness, bayes optimal classifier, affirmative action
TL;DR: Conditions for tradeoff free data distributions for fairness and accuracy.
Abstract: To fix the ‘bias in, bias out’ issue in fair machine learning, it is essential to get ideal training and validation data. Collecting ideal real-world data or generating ideal synthetic data requires a formal specification of ideal distribution that would guarantee fair outcomes by downstream models. Previous work on fair pre-processing does not address this gap, and could be significantly improved if it is resolved. We call a distribution as ideal distribution if the minimizer of any cost-sensitive risk on it is guaranteed to satisfy exact fairness (e.g., demographic parity, equal opportunity). Given any data distribution for fair classification, we formulate an optimization program to find its nearest ideal distribution in KL-divergence. This optimization is intractable as stated but we show how it can be solved efficiently when the distributions come from well-known parametric families (e.g., normal, log-normal). We empirically show on synthetic datasets that our ideal distributions are close to the given distributions and they can often suggest directions to steer the original distribution to improve both accuracy and fairness simultaneously.
Submission Number: 73
Loading