%\documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
%\documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools,appendix} % amsmath with fixes and additions
\usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{subfigure,algorithm,algorithmic}
\usepackage{amsfonts,amsmath,amsthm}
\usepackage{multirow}
\usepackage{color}
\usepackage{makecell}

\newtheorem{assumption}{Assumption}
\newtheorem{remark}{Remark}
\newtheorem{lemma}{Lemma}
\newtheorem{definition}{Definition}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
\newtheorem{proposition}{Proposition}
%\allowdisplaybreaks
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{On the Inductive Biases of Demographic Parity-based Fair Learning Algorithms}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<hylei22@cse.cuhk.edu.hk>?Subject=Your UAI 2024 paper}{Haoyu~Lei}{}}
\author[2]{\href{mailto:<agohari@ie.cuhk.edu.hk>?Subject=Your UAI 2024 paper}{Amin~Gohari}}
\author[1]{\href{mailto:<farnia@cse.cuhk.edu.hk>?Subject=Your UAI 2024 paper}{Farzan~Farnia}}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science and Engineering\\
    The Chinese University of Hong Kong
}
\affil[2]{%
    Department of Information Engineering\\
    The Chinese University of Hong Kong
}
  
\begin{document}
\maketitle

\begin{abstract}

Fair supervised learning algorithms assigning labels with little dependence on a sensitive attribute have attracted great attention in the machine learning community. While the demographic parity (DP)  notion has been frequently used to measure a model's fairness in training fair classifiers, several studies in the literature suggest potential impacts of enforcing DP in fair learning algorithms. In this work, we analytically study the effect of standard DP-based regularization methods on the conditional distribution of the predicted label given the sensitive attribute. Our analysis shows that an imbalanced training dataset with a non-uniform distribution of the sensitive attribute could lead to a classification rule biased toward the sensitive attribute outcome holding the majority of training data. To control such inductive biases in DP-based fair learning, we propose a sensitive attribute-based distributionally robust optimization (SA-DRO) method improving robustness against the marginal distribution of the sensitive attribute. Finally, we present several numerical results on the application of DP-based learning methods to standard centralized and distributed learning problems. The empirical findings support our theoretical results on the inductive biases in DP-based fair learning algorithms and the debiasing effects of the proposed SA-DRO method. The project code is available at \url{github.com/lh218/Fairness-IB.git}.


\end{abstract}

\section{Introduction}
A responsible deployment of modern machine learning frameworks in high-stake decision-making tasks requires mechanisms for controlling the dependence of their output on sensitive attributes such as gender and ethnicity.  
A supervised learning framework with no control on the dependence of the prediction on the input features could lead to discriminatory decisions that significantly correlate with the sensitive attributes. Due to the critical importance of the fairness factor in several machine learning applications, the study and development of fair statistical learning algorithms have received great attention in the literature.



A widely-used approach to fair supervised learning is to include a fairness regularization penalty term in the learning objective that quantifies the level of fairness violation according to a fairness notion. A standard fairness notion  is the \emph{demographic parity (DP)} aiming toward a statistically independent prediction variable $\widehat{Y}$ of a sensitive attribute $S$.
Therefore, a DP-based fairness regularization metric should be a measure of the dependence of the prediction $\widehat{Y}$ on the sensitive attribute $S$. In the literature, several dependence measures from statistics and information theory have been attempted to develop DP-based fair learning methodologies \citep{zafar2017fairness,mary2019fairness, baharlouei2019r,rezaei2020fairness,cho2020afair,cho2020bfair,lowy2022stochastic}.

In practice, the applications of standard DP-based fair classification methods usually succeed in significantly reducing the DP fairness violation, while the model's original accuracy on test data can be mostly preserved. Therefore, an accuracy-based evaluation of the DP-based trained models often suggests that the improvement in the DP fairness metric can be significantly higher than the loss in the model's prediction accuracy. On the other hand, well-known studies including \citep{dwork2012fairness} and \citep{hardt2016equality}  have raised concerns about the potential impacts of DP-based fairness evaluation, which can disproportionately increase the inaccuracy rate among minority subgroups. To address the concerns, \cite{hardt2016equality} propose and promote a different fairness notion, \emph{equalized odds (EO)}, where the goal is a prediction variable $\widehat{Y}$ that is conditionally independent of the sensitive attribute $S$ given the true label $Y$. 

In this work, we study and analyze the inductive biases of DP-based fair learning algorithms. We aim to theoretically and empirically demonstrate the biases induced by a DP-based learning framework toward the majority sensitive attribute  outcome under an imbalanced distribution of the sensitive attribute over the target population. To this end, we provide theoretical results indicating the biases of DP-based fair decision rules toward the label distribution conditioned to the sensitive attribute-based majority subgroup with an occurrence probability greater than $\frac12$. 
We show the existence of such a prediction distribution in a DP-based fair learning algorithm formulated by constraining the difference of demographic parity (DDP).
 

To reduce the biases of DP-based learning algorithms, we propose a \emph{sensitive attribute-based distributionally robust optimization (SA-DRO)} method where the fair learner minimizes the worst-case DP-regularized loss over a set of sensitive attribute marginal distributions centered around the data-based marginal distribution. As a result, the SA-DRO approach can account for different frequencies of the sensitive attribute outcomes and thus offer a robust behavior to the changes in the sensitive attribute's majority outcome.  

We present the results of several numerical experiments on the potential biases of DP-based fair classification methodologies to the sensitive attribute possessing the majority in the dataset. Our empirical findings are consistent with the theoretical results, suggesting the inductive biases of DP-based fair classification rules toward the sensitive attribute-based majority group. On the other hand, our results indicate that the SA-DRO-based fair learning method results in fair classification rules with a lower bias toward the label distribution under the majority sensitive attribute. The following is a summary of this work's main contributions:
\begin{itemize}[leftmargin=*]
    \item Analytically studying the biases of DP-based fair learning toward the majority sensitive attribute, 
    \item Proposing a distributionally robust optimization method to lower the biases of DP-based fair classification,
    \item Providing numerical results on the biases of DP-based fair learning in centralized and federated learning scenarios.  
\end{itemize}

\section{Related Works}
\textbf{Fairness Violation Metrics}.
In this work, we focus on the learning frameworks aiming toward demographic parity (DP). Since enforcing DP to strictly hold could be costly and damaging to the learner's performance, the machine learning literature has proposed applying several metrics assessing the dependence between random variables, including:
the mutual information: \citep{kamishima2011fairness,rezaei2020fairness,zhang2018mitigating,cho2020afair,roh2020fr}, Pearson correlation \citep{zafar2017fairness,beutel2019putting}, kernel-based maximum mean discrepancy: \citep{prost2019toward}, kernel density estimation of the difference of demographic parity (DDP) measures  \citep{cho2020bfair}, the maximal correlation \citep{mary2019fairness,baharlouei2019r,grari2019fairness,grari2021learning}, and the exponential Renyi mutual information \citep{lowy2022stochastic}. In our analysis, we mostly focus on a DP-based fair regularization scheme, while we show only weaker versions of the inductive biases could further hold in the case of mutual information and maximal correlation-based fair learning algorithms. 

In addition to DP, the notions of equalized odds and equal opportunity \citep{hardt2016equality} are standard fairness notions in the literature, where the learner aims for a conditionally independent decision variable $\widehat{Y}$ of sensitive attribute $S$ given label $Y$. We note that the mentioned frameworks based on dependence measures can be aimed at equalized odds, where the dependence measure should be conditioned to label $Y$. Hence, our findings do not apply to the equalized odds fairness notion and the extension of the dependence measure-based learning algorithms aiming equalized odds.  

\textbf{Fair Classification Algorithms.}
Fair machine learning algorithms can be classified into three main categories: pre-processing, post-processing, and in-processing. Pre-processing algorithms \citep{feldman2015certifying,zemel2013learning,calmon2017optimized} transform biased data features into a new space where labels and sensitive attributes are statistically independent. 
Post-processing methods such as \citep{hardt2016equality,pleiss2017fairness} aim to alleviate the discriminatory impact of a classifier by modifying its ultimate decision. The focus of our work focus is only on in-processing approaches regularizing the training process toward DP-based fair models. Also, \citep{hashimoto2018fairness,wang2020robust,lahoti2020fairness} propose distributionally robust optimization (DRO) for fair classification; however, unlike our method, these works do not apply DRO on the sensitive attribute distribution to reduce the biases. 

\textbf{Fairness-aware Imbalanced Learning.} 
To address the challenges of generalization in machine learning models, particularly when handling highly imbalanced classes and limited samples within each class, some well-known imbalanced learning methods like \citep{lin2017focal} and \citep{cao2019learning} have been proposed. More specifically, several articles \citep{iosifidis2020online}, \citep{subramanian2021fairness}, \citep{deng2022fifa} and \citep{tarzanagh2023fairness} extended to fairness-aware imbalanced learning dealing with imbalanced subgroups based on sensitive attributes. Compared to those methods, our SA-DRO method has more flexibility in exploring the accuracy-inductive bias trade-off controlled by varying the coefficient of the regularization term.


\section{Preliminaries}
\subsection{Fair Supervised Learning}

To achieve fairness in supervised learning, the decision making process should not unfairly advantage or disadvantage any particular group of people based on their demographic characteristics such as race, gender, or age, which we refer to as the sensitive attribute in this paper. In this setting, we suppose the learner has access to labeled training data $(\mathbf{x}_i,y_i,s_i)_{i=1}^n$ independently drawn from the underlying distribution $P_{\mathbf{X},Y,S}$. Here, $\mathbf{X}\in\mathcal{X}\subseteq\mathbb{R}^d$ is the $d$-dimensional feature vector, $Y\in\mathcal{Y}$ denotes the label variable, and $s\in\mathcal{S}$ denotes the sensitive attribute, which we suppose are provided for the training data. 

In the supervised learning problem, the learner selects a function $f\in \mathcal{F}$ where $\mathcal{F}$ is the set of prediction functions mapping the observed $(\mathbf{X},S)$ to the label space $\mathcal{Y}$. We use loss function $\ell:\mathcal{Y}\times\mathcal{Y}\rightarrow \mathbb{R}$ to quantify the loss $\ell(y,\widehat{y})$ when predicting $\widehat{y}$ under a true label $y$. Specifically, we consider the 0/1 loss $\ell_{0/1}(\widehat{y},y)=\mathbf{1}(\widehat{y}\neq y)$, where $\mathbf{1}(\cdot)$ denotes the indicator function. The primary goal of the fair supervised learner is to find prediction rules $f\in\mathcal{F}$ achieving smaller values of risk function $\mathbb{E}_{(\mathbf{X},Y,S)\sim P}\bigl[\ell(f(\mathbf{X},S),Y)\bigr]$ while having little dependence on $S$ according to the factors explained in the next subsections.

\subsection{Fairness Criteria}
In a fair supervised learning algorithm, the learned prediction rule is expected to meet a fairness criterion. Here, we review two standard fairness criteria in the literature:

\begin{itemize}[leftmargin=*]
    \item \textbf{Demographic parity (DP)} is a fairness condition that requires the prediction $\widehat{Y}$ to be statistically independent of the sensitive attribute, $S$, i.e., for every $\widehat{y}\in \mathcal{Y}, s \in \mathcal{S}$
\begin{equation*}
\centering
 P\bigl(\widehat{Y}=\widehat{y}\, \big\vert\, S=s\bigr) = P\bigl(\widehat{Y}=\widehat{y}\bigr)
\end{equation*}
where $\widehat{Y}=f(\mathbf{X},S)$ represents the predicted label. A standard quantification of the violation of DP is the Difference of Demographic Parity (DDP):
\begin{align*}
\centering
\mathrm{DDP}(\widehat{Y},S) =\hspace{-2.5mm} \sum_{y\in\mathcal{Y} , s \in \mathcal{S}} \Bigl\vert P(\widehat{Y}=y\vert S=s) - P(\widehat{Y}=y) \Bigr\vert
\end{align*}

\item \textbf{Equalized Odds (EO)} \citep{hardt2016equality} is a fairness condition requiring the predicted label $Y$ to be conditionally independent from sensitive attribute $S$ given actual label $Y$, i.e. for every $s \in \mathcal{S}, y,\widehat{y} \in\mathcal{Y}$
\begin{align*}
\centering
 P\bigl(\widehat{Y}=\widehat{y}\, \big\vert \, Y=y , S=s\bigr) = P\bigl(\widehat{Y}=\widehat{y}\, \big\vert \, Y=y\bigr). 
\end{align*}
A sensible measurement of the lack of EO is the Difference of Equalized Odds (DEO):
\begin{align*}
\centering
\mathrm{DEO}(\widehat{Y},S|Y) = \hspace{-2.5mm}\sum_{ s \in \mathcal{S},y,\hat{y}\in\mathcal{Y} } \Bigl\vert\,  &P(\widehat{Y}=\widehat{y}\, \big\vert\, Y=y, S=s)  \\
& - P\bigl(\widehat{Y}=\widehat{y}\,\big\vert\, Y=y\bigr) \Bigr\vert
\end{align*}
\end{itemize}

\subsection{Dependence Measures for Fair Supervised Learning}
To measure the DP-based fairness violation, the machine learning literature has proposed the application of several dependence measures which we analyze in the paper. In the following, we review some of the applied dependence metrics:
\begin{itemize}[leftmargin=*]
    \item \textbf{Mutual Information (MI)}: 
    Mutual information $I(Y;S)$ is a standard measure of the dependence between random variables $Y$ and $S$ used for developing fair learning methods \citep{cho2020afair}.
    The mutual information $I(Y;S)$ is defined as
    \begin{equation*}
       I(Y;S) := \sum_{y\in \mathcal{Y}, s\in\mathcal{S}} P_{Y,S}(y,s) \log \frac{P_{Y,S}(y,s)}{P_{Y}(y)P_S(s)} 
    \end{equation*}
    It can be seen that $I(Y;S)=\mathrm{D}_{\mathrm{KL}}(P_{Y,S} ; P_{Y} P_S )$ is the KL-divergence between joint distribution $P_{Y,S}$ and product of marginal distributions $P_{Y}\times P_S$, implying $I(Y;S)=0$ if and only if $Y$ and $S$ are statistically independent, i.e., $Y\bot S$. Note that KL-divergence is a special case of $f$-divergence $d_f(P,Q)=\mathbb{E}_{ P} [f(P(x)/Q(x))]$  with $f(t)=t\log t$. \vspace{2mm}  
    \item  \textbf{Maximal Correlation (MC)}:
    The maximal correlation $\rho_m(Y,S)$ is the maximum Pearson correlation $\rho_P\bigl(f(Y) ,g(S) \bigr) = \frac{\mathrm{Cov}(f(Y),g(S))}{\sqrt{\mathrm{Var}(f(Y))\mathrm{Var}(g(S))}}$ between $f(Y)$ and $g(S)$ over all functions $f,\, g$. The maximal correlation can be simplified to the optimal value of the following optimization:
    \begin{align*}
        \rho_m(Y,S) := \hspace{-3mm}\sup_{\substack{f,g:\; \mathbb{E}[f(Y)]=\mathbb{E}[g(S)]=0 \\ \qquad\,\mathbb{E}[f^2(Y)]=\mathbb{E}[g^2(S)]=1 }} \hspace{-1.5mm}\mathbb{E}\bigl[f(Y)g(S) \bigr]
    \end{align*}
    Maximal correlation has been utilized as a measure of demographic parity in the literature on fair learning algorithms \citep{mary2019fairness,baharlouei2019r}. 
    \item  \textbf{Exponential Rényi Mutual Information (ERMI)}: The ERMI between random variables $Y$ and $S$, which is considered  by \cite{lowy2022stochastic} as the dependence measure of fairness penalty, is $$\rho_E(Y,S) := \chi^2(P_{Y,S} ; P_{Y}\times P_S ),$$ i.e, the $\chi^2$-divergence between the joint distribution $P_{Y,S}$ and the product of marginal distributions $P_{Y}\times P_S$. Similar to KL-divergence, $\chi^2$-divergence is an $f$-divergence $d_f(P,Q)$  with $f(t)=(t-1)^2$. Similar to the previous two dependence measures, $\rho_E(Y,S)=0$ if and only if $Y,\, S$ are independent.  
\end{itemize}


\section{Inductive Biases of DP-based Fair Supervised Learning}
As discussed earlier, fair learning based on the demographic parity (DP) notion requires a bounded dependence between the classifier's output $\widehat{Y}$ and sensitive attribute $S$. A standard approach widely-used in the literature to DP-based fair classification is to target the following optimization problem for a dependence measure $\rho (\widehat{Y} , S)$ between $S$ and predicted variable $\widehat{Y} = f(\mathbf{X},S)$ given a randomized prediction rule $f\in\mathcal{F}$ where $\mathcal{F}$ is a set of functions mapping $\mathbf{x}\in\mathcal{X},s\in\mathcal{S}$ to a random $\widehat{Y}\in\mathcal{Y}$ with a conditional distribution $P_{\widehat{Y}|\mathbf{X},S}$:
\begin{align}\label{Eq: Fair Classification 1}
   &\min_{f\in\mathcal{F}
   }\qquad\;\; \mathbb{E}_{p_{\mathbf{X},Y,S}}\Bigl[ \ell_{0/1}\bigl( \widehat{Y}, Y\bigr)\Bigr] \\
    &\text{\rm subject to}\;\;\; \rho\bigl(\widehat{Y} , S\bigr) \le \epsilon \nonumber
\end{align}

Our first theorem shows that if one chooses DDP as the dependence measure $\rho$ and that $Y$ can be deterministically determined by $\mathbf{X},S$, then for the optimal solution $\widehat{Y}= f^*(\mathbf{X},S)$ to the above problem, the conditional distribution $P_{\widehat{Y}|S=s}$ for every $s$ will be close to the conditional distribution $P_{{Y}|S=s_{\max}}$ of $Y$ conditioned on the majority sensitive attribute $s_{\max}=\arg\!\max_{s\in\mathcal{S}} P_S(s)$. In the theorem, we use $\mathrm{TV}$ to denote the total variation distance between distributions $P_Y$ and $Q_Y$ defined as
$$ {TV}(P_Y , Q_Y) \, := \, \frac{1}{2}\sum_{y\in\mathcal{Y}}\, \bigl\vert P_Y(y) - Q_Y(y)\bigr\vert $$
\begin{theorem}\label{Thm: Theorem 1 for DDP}
Consider fair learning problem \eqref{Eq: Fair Classification 1} where $\rho$ is the DDP function and $\mathcal{F}$ is the space of all randomized maps generating all conditional distribution $P_{\widehat{Y}|\mathbf{X},S}$'s. Suppose that $Y= h(\mathbf{X},S)$ is a deterministic function $h$ of $\mathbf{X},S$. Then, if the majority sensitive attribute $s_{\max}$ satisfies $P(S=s_{\max})=\frac{1}{2}+\delta$ for a positive $\delta>0$, then the following bound holds for the optimal predicted variable $\widehat{Y}=f^*(\mathbf{X},S)$ where $f^*$ is the optimal solution to \eqref{Eq: Fair Classification 1}
\begin{equation*}
    \forall s\in\mathcal{S}:\quad \mathrm{TV}\Bigl( P_{\,\widehat{Y}|S=s} , P_{\,{Y}|S=s_{\max}}\Bigr) \,\le\, \bigl(\frac{1}{2}+\frac{1}{4\delta}\bigr)\epsilon
\end{equation*}
 \end{theorem}
\begin{proof}
We defer the proof to the Appendix.
\end{proof}
\begin{corollary}\label{corollary: independence}
In the setting of Theorem~\ref{Thm: Theorem 1 for DDP}, if $\epsilon = 0 $, i.e., $\widehat{Y}$ and $S$ are constrained to be statistically independent, then $P(S=s_{\max})>\frac{1}{2}$ results in the following for the optimal predicted variable $\widehat{Y}=f^*(\mathbf{X},S)$:
\begin{equation*}
    \forall s\in\mathcal{S}:\;\; P_{\widehat{Y}|S=s} \, =\,  P_{{Y}|S=s_{\max}}
\end{equation*}
\end{corollary}
\iffalse
\begin{remark}
   Consider Corollary~\ref{corollary: independence}'s setting with $\epsilon=0$. Then, in a general case where $P(S=s_{\max})\not\ge \frac{1}{2}$, it can be seen that the optimal conditional distribution $P_{\widehat{Y}|S=s}$ will be inductively biased toward the \emph{geometric median} $\mathrm{Geomteric}\text{-}\mathrm{Median}_{s\sim P_S}\bigl[P_{\widehat{Y}|S=s}\bigr]$. Note that when $P(S=s_{\max}){>} \frac{1}{2}$, the geometric median reduces to $P_{Y|S=s_{\max}}$. In this statement, the geometric median is defined based on the total variation metric distance. 
\end{remark}
\fi

The above results show that given a sensitive attribute $s_{\max}$ holding more than half of the training data, the optimal DDP-fair prediction $\widehat{Y}$ will possess a conditional distribution $P_{\widehat{Y}|S=s}$ which for every $s$ is at a bounded TV-distance from the majority $s_{\max}$-based conditional distribution $P_{Y|S=s_{\max}}$. Therefore, the results indicate the inductive bias of a DDP-based fair learning toward the majority sensitive attribute. Next, we show that a weaker version of the DDP-based bias could also hold for the mutual information, ERMI, and maximal correlation-based fair learning.

\begin{theorem}\label{Theorem 2: }
 Consider the fair learning setting in Theorem~\ref{Thm: Theorem 1 for DDP} with a different selection of dependence measure $\rho$. Then,
 \begin{itemize}[leftmargin=*]
     \item assuming $\rho(\widehat{Y},S)$ is the mutual information $I(\widehat{Y};S)$:
\begin{equation*}
    \mathbb{E}_{s\sim P_S}\Bigl[\mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{{Y}|S=s_{\max}}\Bigr) \Bigr] \le \bigl(\frac{1}{2}+\frac{1}{4\delta}\bigr)\sqrt{\frac{2\epsilon}{\log e}}
\end{equation*}
 \item assuming $\rho(\widehat{Y},S)$ is the ERMI $\rho_E(\widehat{Y},S)$ and defining $u(\epsilon)=\max\bigl\{\epsilon,\sqrt{\epsilon}\bigr\}$:
\begin{equation*}
    \mathbb{E}_{s\sim P_S}\Bigl[\mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{{Y}|S=s_{\max}}\Bigr) \Bigr]\le \bigl(\frac{1}{2}+\frac{1}{4\delta}\bigr)u(\epsilon)
\end{equation*}
\item assuming $\rho(\widehat{Y},S)$ is maximal correlation $\rho_m(\widehat{Y},S)$ and $r=\min\bigl\{\vert \mathcal{S}\vert , \vert \mathcal{Y}\vert\bigr\} - 1$ ($\vert \cdot\vert$ denotes a set's cardinality):
\begin{equation*}
    \mathbb{E}_{s\sim P_S}\Bigl[\mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{{Y}|S=s_{\max}}\Bigr) \Bigr] \le \bigl(\frac{1}{2}+\frac{1}{4\delta}\bigr)u(r\epsilon)
\end{equation*}
 \end{itemize}
\end{theorem}
\begin{proof}
    We defer the proof to the Appendix.
\end{proof}
We remark the difference between the bias levels shown for the DDP case in Theorem~\ref{Thm: Theorem 1 for DDP} and the other dependence metrics in Theorem~\ref{Theorem 2: }. The bias level for a DDP-based fair learner could be considerably stronger than that of mutual information, ERMI, and maximal correlation-based fair learners, as the wort-case of total variations in Theorem~\ref{Thm: Theorem 1 for DDP} is replaced by their expectation according to $P_S$ in Theorem~\ref{Theorem 2: }.    
%The above theorem shows that the inductive biases discussed under DDP-based fair learning could similarly hold under other dependence measures used in the fairness literature. However, we note that the equalized-odds-based formulations using the conditional version of the dependence measures do not lead to such inductive biases, as the formulation does not require an independent prediction of the sensitive attribute.


\subsection{Extending the Theoretical Results to Randomized Prediction Rules}


Here, we consider the possibility of a randomized mapping from $(\mathbf{X},S)$ to $Y$. Such a possibility needs to be considered when the actual label $Y$ may not be deterministically determined by $\mathbf{X},S$. Therefore, we formulate and analyze the following generalization of the problem formulation in \eqref{Eq: Fair Classification 1}  where we attempt to find the conditional distribution $P_{Y|\mathbf{X},S}$: 
\begin{align}\label{Eq: Fair Classification 2}
    &\min_{Q_{\widehat{Y}|\mathbf{X},S}\in \mathcal{Q}} \mathbb{E}_{ P_{\mathbf{X},S}}\Bigl[ \ell_{TV}\bigl( Q_{\widehat{Y}|\mathbf{X}=\mathbf{x},S=s} , P_{Y|\mathbf{X}=\mathbf{x},S=s}\bigr)\Bigr] \\
    &\text{\rm subject to}\;\;\; \rho\bigl(\widehat{Y} , S\bigr) \le \epsilon\nonumber
\end{align}
In this formulation, we aim to find an accurate estimation of the conditional distribution $Q_{\widehat{Y}|\mathbf{X},S}$ from a feasible set $\mathcal{Q}$ which corresponds to the function set $\mathcal{F}$ in \eqref{Eq: Fair Classification 1}. We measure the learning performance under every outcome $\mathbf{x},s \sim P_{\mathbf{X},S}$ using the total variation loss $\ell_{TV}(P,Q)= \mathrm{TV}(P,Q)$. %Note that the total variation between distributions $p$ and $q$ is defined as
%$$ \ell_{TV}(p , q) \, := \, \frac{1}{2}\sum_{y\in\mathcal{Y}} \bigl\vert p(y) - q(y)\bigr\vert. $$
Note that the total variation loss generalizes the 0/1 loss to the space of probability measures, since it
is the minimum expected 0/1 loss under the optimal coupling between the marginal distributions: 
\begin{equation*}
    \ell_{\mathrm{TV}}(P,Q) \, = \, \min_{ \substack{M_{\widehat{Y},Y}:\: M_{\widehat{Y}}=P\\ \qquad\;\;\, M_{Y}=Q}} \; \; \mathbb{E}_M\Bigl[ \ell_{0/1}\bigl(\widehat{Y}, Y\bigr)\Bigr].
\end{equation*}
Therefore, if under both $Q$ and $P$, $Y$ is determined deterministically by $\mathbf{X},S$, the above TV-loss will be the same as the expected 0/1 loss of the deterministic classification rule following such $Q_{\widehat{Y}|\mathbf{X},S}$. In the following theorem, we attempt to relax the assumptions in Theorems \ref{Thm: Theorem 1 for DDP}-\ref{Theorem 2: } to apply them to learning settings where $Y$ may not be completely determined by $\mathbf{X},S$.%, and instead we use the following assumption on the distribution $p_{\mathbf{X},Y,S}$.

\begin{theorem}\label{Thm: Non-deterministic}
    Consider the settings in Theorem~\ref{Thm: Theorem 1 for DDP} and Theorem~\ref{Theorem 2: } where we instead consider the generalized formulation \eqref{Eq: Fair Classification 2} and do not require that $Y$ is a function of $\mathbf{X},S$. Suppose a function $\phi:\mathcal{X}\times \mathcal{Y}\rightarrow \mathbb{R}$ exists such that the underlying distribution $P_{\mathbf{X},Y,S}$ satisfies the following property on the ratio between conditional distributions $P_{\mathbf{X}|Y,S}$ and $P_{\mathbf{X}|S}$:
\begin{equation}\label{Eq: Assumption s-independent ratio}
    \forall \mathbf{x}\in\mathcal{X},y\in \mathcal{Y},s\in\mathcal{S}:\quad \frac{P\bigl(\mathbf{x}\, \big\vert\, y,s\bigr)}{P\bigl(\mathbf{x}\,\big\vert\, s\bigr)} \, = \, \phi\bigl(\mathbf{x},y\bigr).
\end{equation}
 Then, the conclusions in Theorems~\ref{Thm: Theorem 1 for DDP},\ref{Theorem 2: } will remain valid.  
\end{theorem}
\begin{proof}
    We defer the proof to the Appendix.
\end{proof}

\begin{remark}\label{Assumption: distribution}
Note that the assumption in the above theorem is equivalent to a $s$-independent ratio $\frac{P(\mathbf{x}|y,s)}{P(\mathbf{x}|s)}$. In particular, this assumption will hold if
the random vector $\mathbf{X}$ can be decomposed to $\bigl[g(S) , \widetilde{\mathbf{X}}\bigr]$, where $g$ is a deterministic function, and under the true distribution $p_{\mathbf{X},Y,S}$, $\widetilde{\mathbf{X}}$ satisfies $\widetilde{\mathbf{X}}\bot S$, i.e., is independent from $S$, and $\widetilde{\mathbf{X}}\bot S \,\big\vert\, Y$, i.e, $\widetilde{\mathbf{X}}$ is conditionally independent from $S$ given $Y$.   
\end{remark}

Finally, we attempt to further relax the assumption in Theorem~\ref{Thm: Non-deterministic} when the distribution ratio $P(\mathbf{x}|y,s)/P(\mathbf{x}|s)$ may not be completely independent of the outcome $S=s$. The next theorem shows a quantification of the deviation from the assumption and how much it can impact the result.

\begin{theorem}\label{Theorem: Extended of Theorem 3}
Consider the setting of Theorem~\ref{Thm: Non-deterministic} and the formulation \eqref{Eq: Fair Classification 2}. We consider the TV-based dependence $\rho_{TV}(Y,S):=\mathbb{E}_{s\sim P_S}\bigl[\mathrm{TV}(P_{Y|S=s}, P_Y)\bigr] $ in the problem.
 Suppose for functions $\phi_L,\,\phi_U :\,\mathcal{X}\times\mathcal{Y} \rightarrow \mathbb{R}$, the following holds for every $\mathbf{x}\in\mathcal{X} , y\in\mathcal{Y}, s\in\mathcal{S}$:
\begin{equation*}
    \phi_L(\mathbf{x},y) \, \le  \, \frac{p\bigl(\mathbf{x} | y,s\bigr)}{p\bigl(\mathbf{x} | s\bigr)}  \, \le \, \phi_U(\mathbf{x},y).
\end{equation*}
Define $\Delta(\mathbf{x},y)=\phi_U(\mathbf{x},y)- \phi_L(\mathbf{x},y)$. Then, if $ \frac{\epsilon}{2} \ge \mathbb{E}_{P_X P_{Y|S=s_{\max}}}\bigl[\Delta(\mathbf{x},y) \bigr]$, %the conclusion of Theorem 1 will remain valid if we substitute $\epsilon$ in the theorem's bound with $\epsilon + 2\mathbb{E}_{P_X P_{Y|S=s_{\max}}}\bigl[\Delta(\mathbf{x},y) \bigr]$ or its upper-bound $2\epsilon$.
for the optimal $Q^*_{\widehat{Y}|\mathbf{X},S}$, $P_{\widehat{Y},\mathbf{X},S}=Q^*_{\widehat{Y}|\mathbf{X},S}\cdot P_{\mathbf{X},S}$ satisfies
\begin{align*}
    &\mathbb{E}_{s\sim P_S}\Bigl[\mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{{Y}|S=s_{\max}}\Bigr) \Bigr] \, \le \, 2\epsilon\bigl(1+\frac{1}{2\delta}\bigr) %\\
    %\le\: &\bigl(1+\frac{1}{2\delta}\bigr)\Bigl(\epsilon + 2\mathbb{E}_{P_X \cdot P_{Y|S=s_{\max}}}\bigl[\Delta(\mathbf{x},y) \bigr]\Bigr)
\end{align*}
\end{theorem}
\begin{proof}
    We defer the proof to the Appendix.
\end{proof}


\section{A Distributionally Robust Optimization Approach to DP-based Fair Learning}
\begin{algorithm}[t]
    \caption{Sensitive Attribute-based Distributionally Robust Optimization (SA-DRO) Fair Learning Algorithm} 
    \label{algo:DROFair}
    \begin{algorithmic}[1]
    \STATE \textbf{Input:} Training data $\{(\mathbf{x}_i,y_i,s_i)_{i=1}^n\}$,   parameters $\lambda , \delta \ge 0$, divergence $d$, dependence measure $\rho$, stepsizes $\alpha_w,\alpha_q > 0$, running iterations $T>0$\vspace{1mm}
    \STATE \textbf{Initialize} classifier weight $\mathbf{w}$ and distribution $\mathbf{q}= \mathbf{p}_{s}$ \vspace{1mm}
        \FOR{$\text{t} \in \{ 1, \ldots , T \}$}\vspace{1mm} 
            \STATE Compute weight gradient of the classifier $f_{\mathbf{w}}$:\vspace{-2mm} 
            $$\mathbf{g}_{\mathbf{w}} = \sum_{i=1}^n\bigl[ \frac{q_{s_i}}{n}\nabla_{\mathbf{w}}\ell\bigl( f_{\mathbf{w}}(\mathbf{x}_i), y_i\bigr)\bigr] + \lambda \nabla_{\mathbf{w}}\rho\bigl(f_{\mathbf{w}}(\mathbf{X}),S\bigr)$$ \vspace{-3mm}
            \STATE Update ${\mathbf{w}}$ with gradient descent: ${\mathbf{w}} \leftarrow {\mathbf{w}} - \alpha_{w} \mathbf{g}_{\mathbf{w}}$ \vspace{1mm} 
            \STATE Compute the gradient of $q_{s}$ for every $s\in\mathcal{S}$:\vspace{-2mm} 
            $${g_{\mathbf{q}}}_s = \frac{1}{n}\sum_{i: s_i =s} \bigl[\ell\bigl( f_{\mathbf{w}}(\mathbf{x}_i), y_i\bigr)\bigr] + \lambda \frac{\partial \rho\bigl(f_{\mathbf{w}}(\mathbf{x}_{1:n}),s_{1:n}\bigr)}{\partial q_s}$$\vspace{-3mm} 
            \STATE Update $\mathbf{q}$ with projected gradient ascent:\\ $\mathbf{q} \leftarrow  \Pi_{\{\mathbf{q}:\: d(\mathbf{q},\mathbf{p}_s))\le\delta\} }\bigl(\mathbf{q} + \alpha_q  g_\mathbf{q}\bigr)$ \vspace{1mm}
        \ENDFOR
    \end{algorithmic}
\end{algorithm}

In this section, we propose a distributionally robust optimization method to reduce the biases of DP-based fair learning algorithms toward the majority sensitive attribute. As discussed before, the optimization of the original risk function under the true distribution  $p_{\mathbf{X},Y,S}$ would lead to biases if a sensitive attribute $s_{\max}$ occurs considerably more than half of the times. To shield the learning algorithm against such biases, we propose applying distributionally robust optimization (DRO) and consider the worst-case expected 0/1 loss over a distribution ball around the sensitive attribute distribution $p_S$ as the target in the learning problem. This approach leads to the \emph{sensitive attribute-based distributionally robust optimization (SA-DRO)} algorithm solving the following formulation of the fair learning problem with  dependence metric $\rho(\widehat{Y},S)$:
\begin{align}\label{Eq: DRO Fair Classification}
   &\min_{f\in\mathcal{F}} \max_{Q_S: d(Q_S,P_S)\le \delta} \mathbb{E}_{P_{\mathbf{X},Y|S}\cdot Q_S}\Bigl[ \ell_{0/1}\bigl( \widehat{Y}, Y\bigr)\Bigr] + \lambda \rho\bigl(\widehat{Y},S\bigr)
\end{align}
According to this formulation, we solve the Lagrangian version of optimization problem \eqref{Eq: Fair Classification 1} when $S$'s marginal distribution $q_S$ leads to the worst-case fair-regularized risk function in a distribution ball $\bigl\{q_S:\, d(q_S,p_S)\le \delta \bigr\}$ where $d$ is a distance measure between probability distributions. In this formulation, we consider assigning different weights to samples with different sensitive attributes, which may result in different majority sensitive attributes. Since we are optimizing the worst-case performance over the distribution ball with a $\delta$ radius, the inductive biases discussed in the previous would become less effective under a greater $\delta$. 



The proposed SA-DRO formulation results in Algorithm~\ref{algo:DROFair} which applies projected gradient descent ascent (GDA) to solve the minimax optimization problem in \eqref{Eq: DRO Fair Classification}. Here, we use a parameterized classifier $f_\mathbf{w}$ to apply a gradient-based training algorithm.
Also, the distance $d$ can be chosen as any standard $f$-divergence. In our experiments, we attempted the $\chi^2$-divergence divergence, which has been well-explored in the literature \citep{namkoong2016stochastic,bertsimas2019adaptive,rahimian2019distributionally}. Furthermore, a Lagrangian form of the SA-DRO problem \eqref{Eq: DRO Fair Classification} can be considered where the DRO constraint on $Q_S$ is transferred to the inner maximization objective function as $-\zeta d(Q_S,P_S)$ for a Lagrangian coefficient $\zeta>0$. 


\section{Numerical Results}
\subsection{Experimental Setup}

\textbf{Datasets.} In our experiments, we attempted the following standard datasets in the machine learning literature: 
\begin{enumerate}[leftmargin=*]
   \item \textit{COMPAS} dataset with 12 features and a binary label on whether a subject has recidivism in two years, where the sensitive attribute is the binary  race feature\footnote{
   https://github.com/propublica/compas-analysis}. To simulate a setting with imbalanced sensitive attribute distribution, we considered 2500 training and 750 test samples, in both of which 80\% are from $S=0$ "non-Caucasian" and 20\% of the samples are from $S=1$ "Caucasian".
   \item \textit{Adult} dataset with 64 binary features and a binary label indicating whether a person has more than 50K annual income. In this case, gender is considered as the sensitive attribute\footnote{https://archive.ics.uci.edu/dataset/2/adult}. In our experiments, we used 15k training and 5k test samples, where, to simulate an imbalanced distribution on the sensitive attribute, 80\% of the data have male gender and 20\% of the samples are females.    
   \item \emph{CelebA} Proposed by \citep{liu2018large}, containing the pictures of celebrities with 40 attribute annotations, where we considered "gender" as a binary label, and the sensitive attribute is the binary variable on blond/non-blond hair.  In the experiments, we used 5k training samples and 2k test samples. To simulate an imbalanced sensitive attribute distribution, 80\% of both training and test samples are marked with Blond hair and 20\% samples are marked with non-blond hair.
\end{enumerate}

\textbf{DP-based Learning Methods}: We performed the experiments using the following DP-based fair classification methods: 1) DDP-based KDE method \citep{cho2020afair} and FACL \citep{mary2019fairness}, 2) the mutual information-based fair classifier \citep{cho2020bfair}, 3) the maximal Correlation-based RFI classifier \citep{baharlouei2019r}, to learn binary classification models on COMPAS and Adult datasets. For CelebA experiments, we used the following two DP-based fair classification methods: KDE method \citep{cho2020afair}, and mutual information (MI) fair classifier \citep{cho2020bfair}. 

In the experiments, we attempted both a logistic regression classifier with a linear prediction model and a neural net classifier. The neural net architecture was 1) for the COMPAS case, a multi-layer perceptron (MLP) with 2 hidden layers with 128 neurons per layer, 2) for the Adult case, an MLP with 4 hidden layers with 512 neurons per layer, 3) for the CelebA case, the ResNet-18 \citep{he2016deep} architecture suited for the image input in the experiments.

\textbf{Evaluation criteria}: To evaluate the trained models, we used the averaged accuracy rate (Acc) as the classification performance metric and the Difference of Demographic Parity (DDP) as the fairness metric. Moreover, to quantify the bias effects of fair learners, we measured the \emph{negative rate (NR) conditioned to a sensitive attribute} defined as $\mathrm{NR}(s) := P\bigl(\hat{Y}=0\mid S=s\bigr)$. This metric is defined to quantify the variations in prediction outcomes across subgroups with different sensitive attribute values.


\subsection{Inductive Biases of Models trained in DP-based Fair Learning}

To numerically analyze the effects of DP-based fair classification algorithms, we varied the regularization penalty coefficient $\lambda$ over the range $[0,1]$. Note that $\lambda=0$ means an ERM setting with no fairness constraint, while $\lambda=1$ is the strongest fairness regularization coefficient over the range $[0,1]$.  

\begin{figure*}[htbp] 
\centering    
\includegraphics[width=2.0\columnwidth]{NN.png}
\caption{The first two columns show the trade-off between accuracy and DDP on the COMPAS and Adult dataset by applying NN-based fair classification methods, while the third column shows that the $\mathrm{NR}(s)$ for each subgroup $s\in\{0,1\}$ will converge to near the majority sensitive attribute.} 
\label{NN}
\end{figure*}

\begin{figure*}[htbp] 
\centering    
\includegraphics[width=2.0\columnwidth]{LR.png}
\caption{The first two columns show the trade-off between accuracy and DDP on the COMPAS and Adult dataset by applying LR-based fair classification methods, while the third column shows that the $\mathrm{NR}(s)$ for each subgroup $s\in\{0,1\}$ will converge to near the majority sensitive attribute.} 
\label{LR}
\end{figure*}

\begin{figure*}[htbp]
\centering
\includegraphics[width=1.8\columnwidth]{Celeba.png}
\caption{Both (a) and (b) show the trade-off between accuracy and DDP on the imbalanced CelebA dataset by applying MI fair classification method, while (c) shows that the $\mathrm{NR}(s)$ for each subgroup will converge to the majority, thus causing more discrimination on the minority group.}\vspace{-3mm}
\label{Resnet}
\end{figure*}

\begin{figure*}
\centering
\subfigure[ERM Classifier for CelebA] {    
\includegraphics[width=0.9\columnwidth]{CelebA_ERM.png}     
}
\subfigure[Fair Classifier for CelebA] { 
\includegraphics[width=0.9\columnwidth]{CelebA_Fair.png}     
} 
\caption{Blond hair samples (Majority, Upper) and Non-blond hair samples (Minority, Lower) in CelebA Dataset predicted by ERM(NN) and MI respectively. The results show that the model has 57.3\% and 98.8\% negative rates, i.e. prefers to predict all samples being female in Minority, even maintaining almost the same level of accuracy in the whole group.}
\label{image}\vspace{-1mm}
\end{figure*}

\begin{figure*}
\centering  
\includegraphics[width=1.8\columnwidth]{DRO.png}
\caption{Accuracy, DDP, and $\mathrm{NR}(s)$ values attained by SA-DRO while varying the Lagrangian coefficient of the DRO regularization term on COMPAS (upper) and Adult (lower) datasets.}
\label{DRO}
\end{figure*}

As the evaluated accuracy and DDP values in 
Figures~\ref{NN} indicate, the DP-based fair learning algorithms managed to significantly reduce the DDP fairness violation while compromising less than 2\% in accuracy. On the other hand, the negative rate ($\widehat{Y}=0$ prediction rate) across the two outcomes $S=0,\, S=1$ of the sensitive attribute tend toward the majority sensitive attribute as the DP-based fairness regularization became stronger, suggesting the conditional distribution of the prediction $\widehat{Y}$ given different sensitive attribute outcome $S=s$'s moved closer to that of the majority sensitive attribute. The observed behavior held similarly using both the linear logistic regression model in Figure~\ref{LR} and Figure~\ref{Resnet}, and Figure~\ref{image} shows how the inductive biases lead to misclassification on CelebA dataset.

\textbf{DRO-based Fair Learning.} We tested Algorithm~\ref{algo:DROFair} utilizing a sensitive attribute-based distributional robust optimization (SA-DRO) to DP-based fair learning algorithms. In our experiments, we applied the SA-DRO algorithm to the DDP-based KDE fair learning algorithm proposed by \cite{cho2020bfair}, and RFI proposed by \cite{baharlouei2019r}. We kept the fairness regularization penalty coefficient to be $\lambda=0.9$. Following the commonly-used implementation of DRO, we used a Lagrangian penalty term $-\zeta d(P_S , Q_S)$ in the inner maximization problem to perform DRO. Therefore, the DRO regularization coefficient, also the Lagrangian multiplier $\zeta$, can take over the range $\left[0, +\infty \right]$, in the table~\ref{DRO-table}, we set $\zeta=0.9$ for SA-DRO case. The visualized results for various DRO regularization coefficients can be found in Appendix.

\renewcommand{\arraystretch}{1.125}
\begin{table}
    \centering
    \caption{Numerical Results on COMPAS and Adult, non-DRO vs SA-DRO implementations. %for Neural-network based and Logistic-regression based method}
    }
    \resizebox{0.5\textwidth}{!}{
    \tabcolsep=0.16cm
    \begin{tabular}{clrrrr}
    \cmidrule[1.0pt]{2-6}

    &Method    & Acc($\uparrow$) & DDP $\downarrow$ & $\mathrm{NR}(s=0)$  & $\mathrm{NR}(s=1)$\\

    \cmidrule[1.0pt]{2-6}

    \parbox[t]{2mm}{\multirow{6}{*}{\rotatebox[origin=c]{90}{\text{\small{COMPAS}}}}}
    &ERM(NN) & 68.0\% & 0.287 & 46.0\% & 74.7\% \\
    &KDE & 66.8\% & 0.027 & 46.3\% & 49.0\% \\
    &KDE (SA-DRO) & 66.0\% & 0.009 & 61.6\% & 62.5\% \\
    \cmidrule[0.5pt]{2-6}
    &ERM(LR) & 67.5\% & 0.287 & 47.0\% & 74.5\% \\
    &RFI & 66.4\% & 0.021 & 48.1\% & 50.2\% \\
    &RFI (SA-DRO) & 65.4\% & 0.017 & 59.3\% & 61.0\% \\

    \cmidrule[1.0pt]{2-6}
    \parbox[t]{2mm}{\multirow{6}{*}{\rotatebox[origin=c]{90}
    {\text{\small{Adult}}}}}
        &ERM(NN)  & 85.1\% & 0.183 & 92.3\% & 74.0\%\\
        &KDE & 83.2\% & 0.023 & 77.3\% & 75.0\% \\
        &KDE (SA-DRO) & 82.5\% & 0.012 & 84.6\% & 83.4\% \\
        \cmidrule[0.5pt]{2-6}
        &ERM(LR) & 82.0\% & 0.189 & 88.1\% & 67.5\% \\
        &RFI & 80.6\% & 0.019 & 69.8\% & 67.9\% \\
        &RFI (SA-DRO) & 80.1\% & 0.021 & 78.3\% & 76.2\% \\
    
    \cmidrule[1.0pt]{2-6}
    \end{tabular}}
    % \vspace{-0.5cm}
    \label{DRO-table}
     \vspace{-3mm}
\end{table}

As Table \ref{DRO-table} shows, we observed that the proposed SA-DRO reduces the tendency of the fair learning algorithm toward the majority sensitive attribute, and the resulting negative prediction rates conditioned to sensitive attribute outcomes became closer to the midpoint between the majority and minority conditional accuracies. On the other hand, the SA-DRO-based algorithms still achieve a low DDP value while the accuracy drop is less than 1\%.

In Figure~\ref{DRO}, we visualized the results of applying SA-DRO algorithm to the DP-based KDE by \cite{cho2020bfair}, and RFI by \cite{baharlouei2019r} for various DRO coefficients. We kept the fairness regularization penalty coefficient to be $\lambda=0.9$, and the DRO regularization coefficient took over the range $\left[0,1\right]$. This Figure~\ref{DRO} shows that the accuracy and DDP among the whole groups or different subgroups are slightly affected, while the $\mathrm{NR}(s)$ for different subgroups will shift from the majority group to the midpoint between the minority group and the majority group to effectively reduce the inductive biases.

\subsection{DP-based Fair Classification in Heterogeneous Federated Learning}

\begin{figure*}[t] 
\centering    
\subfigure{  
\includegraphics[width=0.83\columnwidth]{figure1_1.png}
}
\centering
\subfigure{     
\includegraphics[width=1.16\columnwidth]{figure1_2.png}     
}
\caption{Biases of DP-based learning algorithms in federated learning with heterogeneous sensitive attribute distributions: 80\% of the training data in Client 1 comes from the minority subgroup (female) of the entire network, while the other clients have 20\% of their data from the minority subgroup. The DP-based KDE fair federated learning algorithm led to a significantly lower accuracy for Client 1 compared to the test accuracy of Client 1's locally (non-federated) trained model.}
\label{Fig: Intro-fig}       
\end{figure*}

To numerically show the implications of the inductive biases of DP-based fair learning algorithms, we simulated a heterogeneous federated learning setting with multiple clients where the sensitive attribute has different distributions across clients. To do this, we split the Adult dataset into 4 subsets of 3k samples to be distributed among 4 clients in the federated learning. While 80\% of the training data in Client 1 (minority subgroup in the network) had Female as sensitive attribute, only 20\% of Clients 2-4 were female samples. We used the same male/female data proportion to assign 750 test samples to the clients. 

For the baseline federated learning method with no fairness regularization, we utilized the FedAvg algorithm \citep{mcmahan2017communication}. For the DP-based fair federated learning algorithms, we attempted the DDP-based KDE and FACL algorithms which result in single-level optimization problem and hence can be optimized in a distributed learning problem by averaging as in FedAvg. We refer to the extended federated learning version of these algorithms as FedKDE and FedFACL. We also tested our SA-DRO implementations of FedKDE and FedFACL, as well as the localized ERM, KDE, FACL models where each client trained a separate model only on her own data. 

To show the impacts of such inductive biases in practice, we focused on a setting with heterogeneous sensitive attribute distributions across clients where the clients' majority sensitive attribute outcome may not agree. Figure~\ref{Fig: Intro-fig} illustrates such a federated learning scenario over the Adult dataset, where Client 1's majority sensitive attribute (female samples) is different from the network's majority group (male samples). In the experiment, Client 1's test accuracy with a DP-based fair federated learning was significantly lower than the test accuracy of a locally-trained fair model learned only on Client 1's data. Such numerical results suggest the possibility of the minority clients' lack of incentive to participate in the fair federated learning process. 

To test our proposed DRO approach, we applied the SA-DRO method. As our numerical results in Table \ref{table-Adult} indicate, the inductive biases of DP-based federated learning could considerably lower the accuracy of Client~1 with a different majority sensitive attribute compared to the other clients. The accuracy drop led to a lower accuracy compared to Client 1's locally fair trained model without any collaboration with the other clients, which may affect the client's incentive to participate in the federated learning process. On the other hand, the SA-DRO implementations of the KDE and FACL methods achieved a better accuracy than Client 1's local model while preserving the accuracy for the majority clients and maintaining the same level of DDP no more than 0.05. We found similar results in the CelebA federated learning experiments, as in Table~\ref{table-celeba} in the Appendix.


\begin{table}[t]
\centering
    \caption{Accuracy and DDP on Adult dataset}
    \resizebox{0.5\textwidth}{!}{
    \begin{tabular}{@{}lcccc@{}}
        \toprule
        & \multicolumn{2}{c}{\textbf{Client 1 (Minority)}} & \multicolumn{2}{c}{\textbf{Client 2-4 (Majority)}}\\
        \cmidrule(l){2-3}
        \cmidrule(l){4-5} 
        & Acc($\uparrow$) & DDP($\downarrow$) & Acc($\uparrow$) & DDP($\downarrow$)\\
        \midrule
        FedAvg & 82.5\% & 0.208 & 90.3\% & 0.206 \\
        ERM(Local) & 81.6\% & 0.203 & 89.0\% & 0.246 \\
        \midrule
        FedKDE & 74.8\% & 0.022 & 89.9\% & 0.029 \\
        FedFACL & 74.5\% & 0.014 & 89.7\% & 0.031 \\
        \midrule
        \textbf{SA-DRO-FedKDE} & \textbf{79.3\%} & 0.041 & 89.6\% & 0.042\\
        \textbf{SA-DRO-FedFACL} & \textbf{79.0\%} & 0.049 & 89.1\% & 0.036\\
        \midrule
        KDE(Local) & 79.0\% & 0.032 & 88.2\% & 0.014 \\
        FACL(Local) & 79.1\% & 0.025 & 88.6\% & 0.017  \\
        \bottomrule
    \end{tabular}
    }
    \label{table-Adult}
\end{table}



\section{Conclusion}
In this work, we attempted to demonstrate the inductive biases of in-processing fair learning algorithms aiming to achieve demographic parity (DP). We also proposed a distributionally robust optimization scheme to reduce the biases toward the majority sensitive attribute. An interesting future direction to our work is to search for similar biases in pre-processing and post-processing fair learning methods. Also, the theoretical comparison between different dependence measures such as mutual information, Pearson correlation, and the maximal correlation on the inductive bias levels will be an interesting topic for future exploration. Finally, characterizing the trade-off between accuracy, fairness violation, and biases toward the majority subgroups will help to better understand the costs of DP-based fair learning.

\section*{Limitations and Broader Impact}
Our theoretical analysis focuses on the total variation loss, which can limit its application to other popular loss functions in statistical learning, e.g. the cross entropy loss. Extending the analytical findings on the inductive biases of fair learning algorithms to other loss functions will be a future direction. Also, we clarify that due to the relatively high dimensions of the datasets in our numerical experiments, we were unable to validate the assumption in Theorems \ref{Thm: Non-deterministic}, \ref{Theorem: Extended of Theorem 3} in the experiments. However, we empirically observed the inductive bias effects as explained in the text.  

Finally, in our numerical analysis of fair learning algorithms, we utilized well-known datasets in the fairness literature, including Adult, COMPAS, and CelebA. We note that our numerical analysis only concerned the characteristics of fair learning algorithms and did not attempt to draw any conclusions about the nature of data distribution in these datasets. The COMPAS dataset has been critically analyzed in the machine learning literature \citep{washington2018argue,bao2021s}, and the connections between the specific dataset and inductive biases of fair learning algorithms will be interesting for future studies.

\section*{Acknowledgments}

The work of Farzan Farnia is partially supported by a grant from the
Research Grants Council of the Hong Kong Special Administrative Region, China, Project 14209920, and is partially supported by a CUHK Direct Research Grant with CUHK Project No. 4055164. The work of Amin Gohari is supported by the CUHK Direct Research Grant No. 4055193. Finally, the authors would like to thank the anonymous reviewers for their constructive feedback and helpful suggestions.

\bibliography{uai2024-template}

\clearpage
\clearpage
\allowdisplaybreaks

\onecolumn
\section{Appendix}
\subsection{Proofs}
\subsubsection{Proof of Theorem \ref{Thm: Theorem 1 for DDP}}
First, we note the following optimal transport-based formulation of the total variation distance between $P_Y$ and $Q_Y$:
\begin{equation*}
    \mathrm{TV}(P,Q) \, = \, \inf_{ \substack{M_{\hat{Y},Y}:\: M_{\hat{Y}}=P\\ \qquad\;\;\, M_{Y}=Q}} \; \; \mathbb{E}_M\Bigl[ \ell_{0/1}\bigl(\widehat{Y}, Y\bigr)\Bigr].
\end{equation*}
Therefore, for the objective function in Equation \eqref{Eq: Fair Classification 1}, we can write the following:
\begin{align*}
    \mathbb{E}_{P_{\mathbf{X,Y,S,\widehat{Y}}}}\Bigl[\ell_{0/1}\bigl(\widehat{Y},Y \bigr)\Bigr] \, &\stackrel{(a)}{=} \, \mathbb{E}_{P_{\mathbf{S}}}\Bigl[\mathbb{E}_{P_{\widehat{Y},Y,\mathbf{X}|S}}\Bigl[\ell_{0/1}\bigl(\widehat{Y},Y \bigr)|S=s\Bigr]\Bigr] \\
    &= \, \mathbb{E}_{P_{\mathbf{S}}}\Bigl[\mathbb{E}_{P_{\widehat{Y},Y|S}}\Bigl[\ell_{0/1}\bigl(\widehat{Y},Y \bigr)|S=s\Bigr]\Bigr] \\
    &\stackrel{(b)}{\ge} \, \mathbb{E}_{P_{\mathbf{S}}}\Bigl[\mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{Y|S=s}\Bigr)\Bigr].
\end{align*}
Here, (a) follows from the tower property of expectation. Also, (b) is a corollary of the optimal transport formulation of the TV-distance. 
On the other hand, the constraint in \eqref{Eq: Fair Classification 1} states that $\mathrm{DDP}(\widehat{Y},S) \le \epsilon$, implying
\begin{align*}
    \mathbb{E}_{P_{\mathbf{S}}}\Bigl[\mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{\widehat{Y}}\Bigr)\Bigr] \, &=\, \sum_{s\in\mathcal{S}} P_S(s)\mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{\widehat{Y}}\Bigr) \\
    &\le\, \sum_{s\in\mathcal{S}} \mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{\widehat{Y}}\Bigr) \\
    &=\, \frac{1}{2}\mathrm{DDP}(\widehat{Y},S) \\
    &\le \, \frac{\epsilon}{2}.
\end{align*}
Knowing that $\mathrm{TV}$ is a metric distance satisfying the triangle inequality, the above equations show that
\begin{align*}   \mathbb{E}_{P_{\mathbf{X,Y,S,\widehat{Y}}}}\Bigl[\ell_{0/1}\bigl(\widehat{Y},Y \bigr)\Bigr] &\ge \mathbb{E}_{P_{\mathbf{S}}}\Bigl[\mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{Y|S=s}\Bigr)\Bigr] \\
    &\stackrel{(c)}{\ge} \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr) - \mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{\widehat{Y}}\Bigr)  \Bigr] \\
    &= \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr)\Bigr]  - \mathbb{E}_{P_{\mathbf{S}}}\Bigl[\mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{\widehat{Y}}\Bigr)  \Bigr]
    \\ &\ge \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr)\Bigr] - \frac{\epsilon}{2},
\end{align*}
where (c) follows from the triangle inequality for TV-distance. 
Considering the above inequality which holds for every feasible distribution $P_{\widehat{Y}|Y,S}$ satisfying the DDP constraint, we focus on the following specific selection of $Q_{\widehat{Y} | \mathbf{X},Y,S}$. Here we suppose $Q_{\widehat{Y} |S=s} = P_{Y|S=s_{\max}}$ for every $s\in\mathcal{S}$. To find the joint distribution $Q^*_{\widehat{Y},Y |S=s}$ we consider the optimal solution to the following TV-based optimal transport problem for every $s\in\mathcal{S}$
\begin{equation*}
    Q^*_{\widehat{Y},Y |S=s}\, := \, \underset{ \substack{M_{\hat{Y},Y}:\: M_{\hat{Y}}=P_{Y|S=s_{\max}}\\ \qquad\;\;\, M_{Y}=P_{Y|S=s}}}{\arg\!\min} \; \; \mathbb{E}_M\Bigl[ \ell_{0/1}\bigl(\widehat{Y}, Y\bigr)\Bigr].
\end{equation*}
Note that given the above selection of $Q^*_{\widehat{Y},Y |S}$ and $Q^*_{\widehat{Y}|Y,S} = Q^*_{\widehat{Y},Y |S} / P_{Y|S}$, we can define the joint distribution $Q_{\mathbf{X},Y,S,\widehat{Y}} := P_{Y,S}\cdot P_{\mathbf{X}|Y,S} Q^*_{\widehat{Y}|Y,S}$ under which $\mathbf{X} \bot \widehat{Y} | Y,S$. Also, under the defined distribution $Q$, $\widehat{Y}$ and $S$ are independent, and we have
\begin{equation*}
    \mathbb{E}_{Q_{\widehat{Y},Y ,S}}\Bigl[\ell_{0/1}\bigl(\widehat{Y},Y\bigr)\Bigr] = \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{Y|S=s_{\max}}\Bigr)\Bigr].
\end{equation*}
Therefore, since $\mathbf{X} \bot \widehat{Y} | Y,S$ and $Y=h(\mathbf{X},S)$ is supposed to be a function of $(\mathbf{X},S)$, we will further have
\begin{equation*}
    \mathbb{E}_{Q_{\widehat{Y},\mathbf{X} ,S}}\Bigl[\ell_{0/1}\bigl(\widehat{Y},Y\bigr)\Bigr] = \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{Y|S=s_{\max}}\Bigr)\Bigr].
\end{equation*}
Since $Q_{\widehat{Y}|\mathbf{X},S}$ is a feasible conditional distribution in the optimization problem \ref{Eq: Fair Classification 1}, we will have
\begin{equation*}
    \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr)\Bigr] - \frac{\epsilon}{2} \, \le \,\mathbb{E}_{Q_{\widehat{Y},\mathbf{X} ,S}}\Bigl[\ell_{0/1}\bigl(\widehat{Y},Y\bigr)\Bigr] \, = \, \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{Y|S=s_{\max}}\Bigr)\Bigr].
\end{equation*}
Therefore,
\begin{align*}
    \frac{\epsilon}{2} \, &\ge \, \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr)\Bigr] - \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{Y|S=s_{\max}}\Bigr)\Bigr] \\
    & = \, \sum_{s\in\mathcal{S}} P_S(s)\biggl(\mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr) - \mathrm{TV}\Bigl( P_{Y|S=s}, P_{Y|S=s_{\max}}\Bigr) \biggr) \\
    & = \, P_S(s_{\max})\biggl(\mathrm{TV}\Bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}}\Bigr)-\mathrm{TV}\Bigl( P_{Y|S=s_{\max}}, P_{Y|S=s_{\max}}\Bigr) \biggr) \\
    &\quad + \sum_{s\neq s_{\max} } P_S(s)\biggl(\mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr) - \mathrm{TV}\Bigl( P_{Y|S=s}, P_{Y|S=s_{\max}}\Bigr) \biggr) \\
    &  = \, (\frac{1}{2}+\delta)\mathrm{TV}\Bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}}\Bigr) \\
    &\quad + \sum_{s\neq s_{\max} } P_S(s)\biggl(\mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr) - \mathrm{TV}\Bigl( P_{Y|S=s}, P_{Y|S=s_{\max}}\Bigr) \biggr) \\
    & \stackrel{(d)}{\ge} \, (\frac{1}{2}+\delta)\mathrm{TV}\Bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}}\Bigr) - \sum_{s\neq s_{\max} } P_S(s)\mathrm{TV}\Bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}} \Bigr) \\
    & \stackrel{(e)}{=} \, (\frac{1}{2}+\delta)\mathrm{TV}\Bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}}\Bigr) - (\frac{1}{2}-\delta)\mathrm{TV}\Bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}} \Bigr) \\
    & = 2\delta\mathrm{TV}\Bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}}\Bigr).
\end{align*}
In the above, (d) comes from the triangle inequality for TV-distance, and $(e)$ holds because $\sum_{s\neq s_{\max} } P_S(s) = 1 -P_S(s_{\max})= \frac{1}{2}-\delta$.  
The above inequality shows that $\mathrm{TV}\bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}}\bigr)\le \frac{\epsilon}{4\delta}$. We combine this inequality with the DDP constraint, which shows for every $s\in\mathcal{S}$
\begin{align*}
    \mathrm{TV}\bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}|S=s}\bigr) \, &\le \, \mathrm{TV}\bigl( P_{Y|S=s_{\max}},P_{\widehat{Y}}\bigr) + \mathrm{TV}\bigl( P_{\widehat{Y}},P_{\widehat{Y}|S=s}\bigr) \\
    &\stackrel{(f)}{\le} \frac{\epsilon}{4\delta} + \frac{\epsilon}{2} \\
    &= \epsilon \bigl(\frac{1}{2} +\frac{1}{4\delta} \bigr).
\end{align*}
In the above, note that $(f)$ holds because  $\mathrm{TV}\bigl( P_{\widehat{Y}},P_{\widehat{Y}|S=s}\bigr)\le \frac{1}{2}\mathrm{DDP}(\widehat{Y};S) \le \frac{\epsilon}{2}$ according to the optimization constraint. Therefore, the proof is complete.

\subsubsection{Proof of Theorem \ref{Theorem 2: }}
We first review the implications of Pinsker's inequality in the cases of mutual information and $\chi^2$-divergence.
\begin{lemma}[Pinsker's inequality for mutual information]\label{lemma: mutual information}
For every pair of random variables $Y,S$, we have
\begin{equation*}
    I(Y;S) \ge 2\log(e) \mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr]^2.
\end{equation*}
\end{lemma}
\begin{proof}
Note that Pinsker's inequality implies that for every outcome $s\in\mathcal{S}$, we have
\begin{equation*}
    2\log(e)\,\mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr)^2  \le D_{KL}\bigl(P_{Y|S=s} , P_Y \bigr)
\end{equation*}
Since $g(t) = 2\log(e)\,  t^2$ is a convex function, Jensen's inequality implies that
\begin{align*}
    2\log(e) \mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr]^2 \, &\le \, \mathbb{E}_{S}\Bigl[ 2\log(e)\mathrm{TV}^2\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr] \\
    &\le \, \mathbb{E}_{S}\Bigl[ D_{KL}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr] \\
    &= \, I(Y;S).
 \end{align*}
 Therefore, the proof is complete.
\end{proof}

\begin{lemma}[Pinsker's inequality for $\chi^2$-divergence-based $f$-mutual-information]\label{lemma: chi-sqaured mutual information}
For every pair of random variables $Y,S$, we have the following for function $h(t) = t^2$ where $|t|\le 1$ and $h(t)=2t-1$ where $t\ge 1$.
\begin{equation*}
    \chi^2 \bigl(P_{Y,S}, P_Y\times P_S\bigr) \ge h\Bigl(2 \mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr]\Bigr)
\end{equation*}
\end{lemma}
\begin{proof}
Note that Pinsker's inequality for the $\chi^2$-divergence \citep{gilardoni2006minimum} implies that for every outcome $s\in\mathcal{S}$, we have
\begin{equation*}
    h\Bigl(2\mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr)\Bigr)  \le D_{KL}\bigl(P_{Y|S=s} , P_Y \bigr)
\end{equation*}
Since $h$ is a convex function, Jensen's inequality implies that
\begin{align*}
    h\Bigl( 2\mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr]\Bigr) \, &\le \, \mathbb{E}_{S}\Bigl[ h\Bigl(2\mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr)\Bigr) \Bigr] \\
    &\le \, \mathbb{E}_{S}\Bigl[ \chi^2\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr] \\
    &= \, \chi^2 \bigl(P_{Y,S}, P_Y\times P_S\bigr).
 \end{align*}
 Hence, the proof is complete.
\end{proof}
\begin{lemma}\label{lemma: maximal correlation}
For every pair of random variables $Y,S$, we have the following for function $h(t) = t^2$ where $|t|\le 1$ and $h(t)=2t-1$ where $t\ge 1$, and constant $r=\min\{\vert \mathcal{S}\vert, \vert \mathcal{Y}\vert \} - 1$: 
\begin{equation*}
    r\rho_m \bigl(Y, S\bigr) \ge h\Bigl(2 \mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr]\Bigr)
\end{equation*}
\end{lemma}
\begin{proof}
The proof follows directly from Lemma \ref{lemma: chi-sqaured mutual information}, noting the following relationship between the maximal correlation $\rho_m \bigl(Y, S\bigr)$ and the Pearson $\chi^2$-divergence-based $f$-mutual information \citep{asoodeh2015maximal}:
\begin{equation*}
   r\rho_m \bigl(Y, S\bigr) \ge  \chi^2 \bigl(P_{Y,S}, P_Y\times P_S\bigr).
\end{equation*}
\end{proof}

\textbf{Proof for the mutual information case.} Given the mutual information constraint $I(\widehat{Y},S) \le \epsilon$ in \eqref{Eq: Fair Classification 1}, we can apply Lemma~\ref{lemma: mutual information} which shows
\begin{align*}
    &2\log(e)\mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr]^2 \le \epsilon \\
    \Rightarrow \quad & \mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr] \le \sqrt{\frac{\epsilon}{2\log(e)}}.
\end{align*}
Note that we can follow the same proof of Theorem~\ref{Thm: Theorem 1 for DDP}, which holds if we change $\mathrm{DDP}(\widehat{Y},S)$ to $\rho_{TV}(Y,S):=\mathbb{E}_{S}\bigl[ \mathrm{TV}\bigl(P_{\widehat{Y}|S=s} , P_{\widehat{Y}} \bigr) \bigr] $, to prove the following statement:
\begin{equation*}
    \mathbb{E}_{S}\Bigl[\mathrm{TV}\bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}|S=s}\bigr)\Bigr] \, \le \, \sqrt{\frac{\epsilon}{2\log(e)}}\bigl( 1 + \frac{1}{2\delta}\bigr) \, =\, \sqrt{\frac{2\epsilon}{\log(e)}}\bigl( \frac{1}{2} + \frac{1}{4\delta}\bigr). 
\end{equation*}

\textbf{Proof for the ERMI case $\rho_E$.} Given the constraint $\rho_E(\widehat{Y},S) \le \epsilon$ in \eqref{Eq: Fair Classification 1}, we can apply Lemma~\ref{lemma: chi-sqaured mutual information} that shows
\begin{align*}
   &h\Bigl( 2\mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr)\Bigl] \Bigr) \le \epsilon \\
    \Rightarrow \quad & 2\mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr] \le \max\{\epsilon, \sqrt{\epsilon}\}.
\end{align*}
In the above, we use the fact that the inverse function of $h(t)$ over $t\ge 0$ satisfies $h^{-1}(t) \le \max\{t,\sqrt{t}\}$ which is a strictly increasing function. As a result, we can follow the same proof of Theorem~\ref{Thm: Theorem 1 for DDP}, which remains valid if we change $\mathrm{DDP}(\widehat{Y},S)$ to $\rho_{TV}(\widehat{Y},S):=\mathbb{E}_{S}\bigl[ \mathrm{TV}\bigl(P_{\widehat{Y}|S=s} , P_{\widehat{Y}} \bigr) \bigr] $, to show the following:
\begin{equation*}
    \mathbb{E}_{S}\Bigl[\mathrm{TV}\bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}|S=s}\bigr)\Bigr] \, \le \, \max\{\epsilon,\sqrt{\epsilon}\}\bigl( \frac{1}{2} + \frac{1}{4\delta}\bigr). 
\end{equation*}

\textbf{Proof for the maximal correlation case $\rho_m$.} Assuming the constraint $\rho_m(\widehat{Y},S) \le \epsilon$ in \eqref{Eq: Fair Classification 1}, we can apply Lemma~\ref{lemma: maximal correlation} showing that
\begin{align*}
   &h\Bigl( 2\mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigl]\Bigr) \le r\epsilon \\
    \Rightarrow \quad & 2\mathbb{E}_{S}\Bigl[ \mathrm{TV}\bigl(P_{Y|S=s} , P_Y \bigr) \Bigr] \le \max\{r\epsilon, \sqrt{r\epsilon}\}.
\end{align*}
As a result, we use the same proof of Theorem~\ref{Thm: Theorem 1 for DDP}, that remains valid if we change $\mathrm{DDP}(\widehat{Y},S)$ to $\rho_{TV}(\widehat{Y},S):=\mathbb{E}_{S}\bigl[ \mathrm{TV}\bigl(P_{\widehat{Y}|S=s} , P_{\widehat{Y}} \bigr) \bigr] $, to show
\begin{equation*}
    \mathbb{E}_{S}\Bigl[\mathrm{TV}\bigl( P_{Y|S=s_{\max}}, P_{\widehat{Y}|S=s}\bigr)\Bigr] \, \le \, \max\{r\epsilon,\sqrt{r\epsilon}\}\bigl( \frac{1}{2} + \frac{1}{4\delta}\bigr). 
\end{equation*}
The proof is therefore complete.


\subsubsection{Proof of Theorem \ref{Thm: Non-deterministic}}

First, we note that under the assumption in Remark~\ref{Assumption: distribution}, there exists a function $\phi:\mathcal{X}\times\mathcal{Y}\rightarrow \mathbb{R}$ for which  $P_{\mathbf{X},Y,S}$ satisfies the following equation on the ratio $P_{\mathbf{X}|Y,S} / P_{\mathbf{X}|S}$:
\begin{equation*}
    \forall \mathbf{x}\in\mathcal{X},y\in \mathcal{Y},s\in\mathcal{S}:\quad \frac{P\bigl(\mathbf{x}\, \big\vert\, y,s\bigr)}{P\bigl(\mathbf{x}\,\big\vert\, s\bigr)} \, = \, \phi\bigl(\mathbf{x},y\bigr).
\end{equation*}
The above holds, since given the assumption in Remark~\ref{Assumption: distribution} we can decompose $\mathbf{X}=[X_0 = g(S), \widetilde{X}]$ such that $\widetilde{X} \bot S$ and $\widetilde{X} \bot S|Y$, implying that 
\begin{align*}
    \frac{P\bigl(\mathbf{x}\, \big\vert\, y,s\bigr)}{P\bigl(\mathbf{x}\,\big\vert\, s\bigr)} \, &= \, \frac{P\bigl([x_0,\widetilde{\mathbf{x}}]\, \big\vert\, y,s\bigr)}{P\bigl([x_0,\widetilde{\mathbf{x}}]\,\big\vert\, s\bigr)} \\
    &= \, \frac{P\bigl(x_0\, \big\vert\, y,s\bigr)P\bigl(\widetilde{\mathbf{x}}\, \big\vert\, y,s\bigr)}{P\bigl(x_0\, \big\vert\, s\bigr)P\bigl(\widetilde{\mathbf{x}}\, \big\vert\, s\bigr)} \\
    &= \, \frac{P\bigl(x_0\, \big\vert\, s\bigr)P\bigl(\widetilde{\mathbf{x}}\, \big\vert\, y\bigr)}{P\bigl(x_0\, \big\vert\, s\bigr)P\bigl(\widetilde{\mathbf{x}}\bigr)} \\
    &= \, \frac{P\bigl(\widetilde{\mathbf{x}}\, \big\vert\, y\bigr)}{P\bigl(\widetilde{\mathbf{x}}\bigr)}.
\end{align*}
To prove Theorem~\ref{Thm: Non-deterministic}, we can follow the initial steps of Theorem~\ref{Thm: Theorem 1 for DDP}'s proof, which did not use the assumption $Y=h(\mathbf{X},S)$, to show the following holds for every feasible distribution $Q_{\widehat{Y}|\mathbf{X},S}$ satisfying the constraint in \eqref{Eq: Fair Classification 2}
\begin{align*}   \mathbb{E}_{ P_{\mathbf{X},S}}\Bigl[ \ell_{TV}\bigl( Q_{\widehat{Y}|\mathbf{X}=\mathbf{x},S=s} , P_{Y|\mathbf{X}=\mathbf{x},S=s}\bigr)\Bigr] &\ge \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr)\Bigr] - \frac{\epsilon}{2}.
\end{align*}
Next, we consider the following conditional distribution $\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}(y,\mathbf{x},s)= P_{Y|S=s_{\max}}(y|s_{\max})\phi(\mathbf{x},\hat{y})$.
We note that $\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}$ is a valid conditional distribution under joint distribution $P_{\mathbf{X},S}$ because for every $s\in \mathcal{S},\, \hat{y}\in\mathcal{Y}$:
\begin{align*}
    \sum_{\mathbf{x}\in\mathcal{X}} P_{\mathbf{X}|S}(\mathbf{x}|S=s)\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}(\hat{y}|\mathbf{x},s)  \, &=\, 
 \sum_{\mathbf{x}\in\mathcal{X}} P_{\mathbf{X}|S}(\mathbf{x}|S=s) P_{Y|S=s_{\max}}(\hat{y}|s_{\max})\phi(\mathbf{x},\hat{y}) \\
 &=\, P_{Y|S=s_{\max}}(\hat{y}|s_{\max})\sum_{\mathbf{x}\in\mathcal{X}} P_{\mathbf{X}|S}(\mathbf{x}|S=s) \phi(\mathbf{x},\hat{y})  \\
 &=\, P_{Y|S=s_{\max}}(\hat{y}|s_{\max})\sum_{\mathbf{x}\in\mathcal{X}} P_{\mathbf{X}|Y,S}(\mathbf{x}|Y=\hat{y},S=s) \\
 &=\, P_{Y|S=s_{\max}}(\hat{y}|s_{\max}),
\end{align*}
which is a valid conditional distribution, implying $\widehat{Y}$ and $S$ are independent under the valid joint distribution $\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}\times P_{\mathbf{X},S}$. Therefore $\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}$ is a feasible conditional distribution in optimization problem \eqref{Eq: Fair Classification 2}, implying that under the optimal solution $Q^*_{Y|\mathbf{X},Z}$ we will have
\begin{align*}    \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr)\Bigr] - \frac{\epsilon}{2} \, &\le \, \mathbb{E}_{ P_{\mathbf{X},S}}\Bigl[ \ell_{TV}\bigl( \widetilde{Q}_{\widehat{Y}|\mathbf{X}=\mathbf{x},S=s} , P_{Y|\mathbf{X}=\mathbf{x},S=s}\bigr)\Bigr] \\
& = \, \mathrm{TV}\bigl( \widetilde{Q}_{\widehat{Y}|\mathbf{X},S}\times P_{\mathbf{X},S}  , P_{Y,\mathbf{X},S}\bigr) \\
& = \, \mathrm{TV}\bigl( P_{Y|S=s_{\max}}\times \phi(\mathbf{X},Y)P_{\mathbf{X},S}  , P_{Y|S}\times \phi(\mathbf{X},Y)P_{\mathbf{X},S}\bigr)\\
\\
& = \, \mathrm{TV}\bigl( P_{Y|S=s_{\max}}\times P_{S}P_{\mathbf{X}|Y,S} , P_{Y|S}\times P_{S}P_{\mathbf{X}|Y,S}\bigr) \\
 & = \, \mathrm{TV}\bigl( P_{Y|S=s_{\max}}\times P_{S} , P_{Y|S}\times P_{S}\bigr) \\
 & = \, \mathbb{E}_S\Bigl[\mathrm{TV}\bigl( P_{Y|S=s_{\max}} , P_{Y|S}\bigr)\Bigr]
\end{align*}
As a result, we have the following inequality for the optimal solution $Q^*_{\hat{Y}|\mathbf{X},S}$ and the constructed $\widetilde{Q}_{\hat{Y}|\mathbf{X},S}$ resulting in an independent $\widehat{Y}$ of $S$, with the marginal distribution $\widetilde{Q}_{\hat{Y}} = P_{Y|S=s_{\max}}$:
\begin{align*}    \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr)\Bigr] - \mathbb{E}_S\Bigl[\mathrm{TV}\bigl( P_{Y|S=s_{\max}} , P_{Y|S}\bigr)\Bigr] \, &\le \frac{\epsilon}{2}.
\end{align*}
Therefore, we can follow the proof of Theorems \ref{Thm: Theorem 1 for DDP},\ref{Theorem 2: } which shows the above inequality leads to the bounds claimed in the theorems.

\subsubsection{Proof of Theorem~\ref{Theorem: Extended of Theorem 3}}


We define the function $\phi_s (\mathbf{x},y):= \frac{p\bigl(\mathbf{x} | y,s\bigr)}{p\bigl(\mathbf{x} | s\bigr)}$
for the true distribution $P_{\mathbf{X},Y,S}$. Then, in particular,
\begin{equation*}
  \phi_{s_{\max}} (\mathbf{x},y)= \frac{p\bigl(\mathbf{x} | y,s_{\max}\bigr)}{p\bigl(\mathbf{x} | s_{\max}\bigr)}  
\end{equation*}
Note that we can follow the initial steps of the proof of Theorem~1 which does not use the assumption $Y=h(\mathbf{X},S)$, to show the following holds for every $P_{\widehat{Y},\mathbf{X},S}=Q_{\widehat{Y}|\mathbf{X},S} \cdot P_{\mathbf{X},S}$ corresponding to a feasible distribution $Q_{\widehat{Y}|\mathbf{X},S}$ satisfying the constraint in \eqref{Eq: Fair Classification 2}
\begin{align*}   \mathbb{E}_{ P_{\mathbf{X},S}}\Bigl[ \ell_{TV}\bigl( Q_{\widehat{Y}|\mathbf{X}=\mathbf{x},S=s} , P_{Y|\mathbf{X}=\mathbf{x},S=s}\bigr)\Bigr]\: &\ge \: \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr)\Bigr] - \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{\widehat{Y}|S=s}, P_{\widehat{Y}} \Bigr)\Bigr]\\
&\ge \:\mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr)\Bigr] - \epsilon.
\end{align*}
Next, we consider the following conditional distribution $$\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}(y|\mathbf{x},s)= P_{Y|\mathbf{X},S}(y|\mathbf{x},s_{\max}) = P_{Y|S=s_{\max}}(y|s_{\max})\phi_{s_{\max}}(\mathbf{x},{y}).$$
Clearly, $\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}$ is a valid conditional distribution. Considering the resulting joint distribution  $\widetilde{Q}_{\widehat{Y},\mathbf{X},S} := P_{\mathbf{X},S}\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}$, for every $s\in \mathcal{S},\, \hat{y}\in\mathcal{Y}$:
\begin{align*}
    \widetilde{Q}_{\widehat{Y}|S}(\hat{y}|s) \, &=\,\sum_{\mathbf{x}\in\mathcal{X}} P_{\mathbf{X}|S}(\mathbf{x}|S=s)\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}(\hat{y}|\mathbf{x},s) \\
    \, &=\, 
 \sum_{\mathbf{x}\in\mathcal{X}} P_{\mathbf{X}|S}(\mathbf{x}|S=s) P_{Y|S=s_{\max}}(\hat{y}|s_{\max})\phi_{s_{\max}}(\mathbf{x},\hat{y}) \\
 &=\, P_{Y|S=s_{\max}}(\hat{y}|s_{\max})\sum_{\mathbf{x}\in\mathcal{X}} P_{\mathbf{X}|S}(\mathbf{x}|S=s) \phi_{s_{\max}}(\mathbf{x},\hat{y}).
\end{align*}
According to the triangle inequality for the TV-distance, we have
\begin{align*}
    \mathbb{E}_{s\sim P_S}\Bigl[\mathrm{TV}\bigl(\widetilde{Q}_{\widehat{Y}|S=s}, \widetilde{Q}_{Y}\bigr)\Bigr]\leq \mathbb{E}_{s\sim P_S}\Bigl[\mathrm{TV}\bigl(\widetilde{Q}_{\widehat{Y}|S=s}, P_{Y|S=s_{\max}}\bigr)\Bigr]+TV\Bigl(P_{Y|S=s_{\max}}, \widetilde{Q}_{Y}\bigr)
    \end{align*}
Thus, to show that
$\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}$ is a feasible conditional distribution in optimization problem (2) with the TV-based measure $\rho_{TV}$, it suffices to show that
$$\mathbb{E}_{s\sim P_S}\Bigl[\mathrm{TV}\bigl(\widetilde{Q}_{\widehat{Y}|S=s}, P_{Y|S=s_{\max}}\bigr)\Bigr]\leq \frac{\epsilon}{2} \qquad \text{\rm and }\qquad TV\Bigl(P_{Y|S=s_{\max}}, \widetilde{Q}_{Y}\Bigr)\leq \frac{\epsilon}{2}.$$
To show the former, 
 we can write the following inequalities:
\begin{align*}
    \Bigl\vert \widetilde{Q}_{\widehat{Y}|S}(\hat{y}|s) - P_{Y|S}(\hat y|s_{\max}) \Bigr\vert \, &=\, P_{Y|S=s_{\max}}(\hat{y}|s_{\max})\Bigl\vert  1 - \sum_{\mathbf{x}\in\mathcal{X}} P_{\mathbf{X}|S}(\mathbf{x}|S=s) \phi_{s_{\max}}(\mathbf{x},\hat{y})\Bigr\vert \\
    &=\, P_{Y|S=s_{\max}}(\hat{y}|s_{\max})\Bigl\vert \sum_{\mathbf{x}\in\mathcal{X}} P_{\mathbf{X}|S}(\mathbf{x}|S=s) \bigl( \phi_{s_{\max}}(\mathbf{x},\hat{y}) - {\phi}_s(\mathbf{x},\hat{y})\bigr)\Bigr\vert \\
    &\le\, \Bigl\vert \sum_{\mathbf{x}\in\mathcal{X}} P_{Y|S=s_{\max}}(\hat{y}|s_{\max}) P_{\mathbf{X}|S}(\mathbf{x}|S=s) \bigl( {\phi}_U(\mathbf{x},\hat{y}) - {\phi}_L(\mathbf{x},\hat{y})\bigr)\Bigr\vert
\\&=\sum_{\mathbf{x}\in\mathcal{X}} P_{Y|S=s_{\max}}(\hat{y}|s_{\max}) P_{\mathbf{X}|S}(\mathbf{x}|S=s){\Delta}(\mathbf{x},\hat{y}).
\end{align*}
As a result,
\begin{align*}
    \mathbb{E}_{s\sim P_S}\Bigl[\mathrm{TV}\bigl(\widetilde{Q}_{\widehat{Y}|S=s}, P_{Y|S=s_{\max}}\bigr)\Bigr] \,&\le\, 
\sum_{s}P_S(s)
\sum_{\mathbf{x}\in\mathcal{X},\hat{y}\in\mathcal{Y}} P_{Y|S=s_{\max}}(\hat{y}|s_{\max}) P_{\mathbf{X}|S}(\mathbf{x}|S=s){\Delta}(\mathbf{x},\hat{y})
    \\&=
\sum_{\mathbf{x}\in\mathcal{X},\hat{y}\in\mathcal{Y}} P_{Y|S=s_{\max}}(\hat{y}|s_{\max}) P_{\mathbf{X}}(\mathbf{x}){\Delta}(\mathbf{x},\hat{y})
   \\
    &=\, \mathbb{E}_{ Y\sim P_{Y|S=s_{\max}} , \mathbf{X}\sim P_{\mathbf{X}}}\Bigl[{\Delta}(\mathbf{X},Y) \Bigr] \\
    &\le \frac{\epsilon}{2}, 
\end{align*}
where the last line follows from the theorem's assumption. Next, we have
\begin{align*}    TV\Bigl(P_{Y|S=s_{\max}}, \widetilde{Q}_{Y}\bigr) \,&=\,\sum_{\hat y}P_{Y|S=s_{\max}}(\hat{y}|s_{\max})\Bigr\vert 
\sum_{s}\sum_{\mathbf{x}\in\mathcal{X}} P_S(s)
P_{\mathbf{X}|S}(\mathbf{x}|S=s) \phi_{s_{\max}}(\mathbf{x},\hat{y})-1\Bigr\vert
\\&=\,\sum_{\hat y}P_{Y|S=s_{\max}}(\hat{y}|s_{\max})\Bigr\vert 
\sum_{s}\sum_{\mathbf{x}\in\mathcal{X}} P_S(s)
P_{\mathbf{X}|S}(\mathbf{x}|S=s) (\phi_{s_{\max}}(\mathbf{x},\hat{y})-\phi_{s}(\mathbf{x},\hat{y}))\Bigr\vert
\\&\leq \,\sum_{\hat y}P_{Y|S=s_{\max}}(\hat{y}|s_{\max})\Bigr\vert 
\sum_{s}\sum_{\mathbf{x}\in\mathcal{X}} P_S(s)
P_{\mathbf{X}|S}(\mathbf{x}|S=s) \Delta(\mathbf{x},\hat{y})\Bigr\vert
\\
    &=\, \mathbb{E}_{ Y\sim P_{Y|S=s_{\max}} , \mathbf{X}\sim P_{\mathbf{X}}}\Bigl[{\Delta}(\mathbf{X},Y) \Bigr] \\
    &\le \frac{\epsilon}{2}.
\end{align*}

Therefore, $\widetilde{Q}_{\widehat{Y}|\mathbf{X},S}$ is a feasible conditional distribution in optimization problem (2) with a DDP measure $\rho_{TV}$. This fact
implies that under the optimal solution $Q^*_{Y|\mathbf{X},Z}$ we will have
\begin{align*}    \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{\widehat{Y}} \Bigr)\Bigr] - \epsilon \, &\le \, \mathbb{E}_{ P_{\mathbf{X},S}}\Bigl[ \ell_{TV}\bigl( \widetilde{Q}_{\widehat{Y}|\mathbf{X}=\mathbf{x},S=s} , P_{Y|\mathbf{X}=\mathbf{x},S=s}\bigr)\Bigr] \\
& = \, \mathrm{TV}\bigl( \widetilde{Q}_{\widehat{Y}|\mathbf{X},S}\times P_{\mathbf{X},S}  , P_{Y,\mathbf{X},S}\bigr) \\
& = \, \mathrm{TV}\bigl( P_{Y|S=s_{\max}}\times \phi_{s_{\max}}(\mathbf{X},Y)P_{\mathbf{X},S}  , P_{Y|S}\times \phi_S(\mathbf{X},Y)P_{\mathbf{X},S}\bigr)\\
& \le \, \mathrm{TV}\bigl( P_{Y|S=s_{\max}}\times \phi_{s_{\max}}(\mathbf{X},Y)P_{\mathbf{X},S}  , P_{Y|S=s_{\max}}\times \phi_S(\mathbf{X},Y)P_{\mathbf{X},S}\bigr) \, \\
&\;+ \,\mathrm{TV}\bigl( P_{Y|S=s_{\max}}\times \phi_S(\mathbf{X},Y)P_{\mathbf{X},S} , P_{Y|S}\times \phi_S(\mathbf{X},Y)P_{\mathbf{X},S}\bigr)
\\
&\le \, \mathbb{E}_{P_{\mathbf{X}}P_{Y|S=s_{\max}}}\Bigl[\phi_{U}(\mathbf{X},Y) - \phi_{L}(\mathbf{X},Y)\Bigr]\\
& \; + \, \mathrm{TV}\bigl( P_{Y|S=s_{\max}}\times P_{S}P_{\mathbf{X}|Y,S} , P_{Y|S}\times P_{S}P_{\mathbf{X}|Y,S}\bigr) \\
 & = \, \mathbb{E}_{P_{\mathbf{X}}P_{Y|S=s_{\max}}}\Bigl[\phi_{U}(\mathbf{X},Y) - \phi_{L}(\mathbf{X},Y)\Bigr] + \mathbb{E}_S\Bigl[\mathrm{TV}\bigl( P_{Y|S=s_{\max}} , P_{Y|S}\bigr)\Bigr]
\end{align*}
As a result, we have the following inequality for the optimal solution $Q^*_{\hat{Y}|\mathbf{X},S}$ and the constructed $\widetilde{Q}_{\hat{Y}|\mathbf{X},S}$ resulting in an independent $\widehat{Y}$ of $S$, with the marginal distribution $\mathrm{TV}(\widetilde{Q}_{\hat{Y}} , P_{Y|S=s_{\max}}) \le \mathbb{E}_{P_{\mathbf{X}}P_{Y|S=s_{\max}}}\bigl[\Delta(\mathbf{X},Y) \bigr]$:
\begin{align*}    &\mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, {\widetilde{Q}}_{\widehat{Y}} \Bigr)\Bigr] - \mathbb{E}_{P_{\mathbf{S}}}\Bigl[\mathrm{TV}\bigl( P_{Y|S=s_{\max}} , P_{Y|S}\bigr)\Bigr] \, \le \epsilon + \mathbb{E}_{P_{\mathbf{X}}P_{Y|S=s_{\max}}}\Bigl[\Delta(\mathbf{X},Y) \Bigr] \\
\Longrightarrow\quad & \mathbb{E}_{P_{\mathbf{S}}}\Bigl[ \mathrm{TV}\Bigl( P_{Y|S=s}, P_{Y|S=s_{\max}} \Bigr)\Bigr] - \mathbb{E}_{P_{\mathbf{S}}}\Bigl[\mathrm{TV}\bigl( P_{Y|S=s_{\max}} , P_{Y|S}\bigr)\Bigr] \, \le \epsilon + 2\,\mathbb{E}_{P_{\mathbf{X}}P_{Y|S=s_{\max}}}\Bigl[\Delta(\mathbf{X},Y) \Bigr]
\end{align*}
Therefore, we can repeat the final step of Theorem 1 which shows the above inequality results in the following bound claimed in the theorem:
\begin{align*}
    \mathbb{E}_{s\sim P_S}\Bigl[\mathrm{TV}\Bigl( P_{\widehat{Y}|S=s} , P_{{Y}|S=s_{\max}}\Bigr) \Bigr] \, &\le \: \bigl(1+\frac{1}{2\delta}\bigr)\Bigl(\epsilon + 2\mathbb{E}_{P_X \cdot P_{Y|S=s_{\max}}}\bigl[\Delta(\mathbf{x},y) \bigr]\Bigr) \\
    &\le\: \bigl(1+\frac{1}{2\delta}\bigr)2\epsilon
\end{align*}


\subsection{Additional Numerical Results}
\subsubsection{Inductive biases for multi-class sensitive attributes}

We perform fair learning experiments on the COMPAS and Adult datasets where instead of a binary $S$, thus we consider a 4-ary sensitive attribute by merging the binary gender and race variables to form a 4-ary sensitive attribute with two different distributions in Figure~\ref{multi-COMPAS} and Figure~\ref{multi_Adult}.

\begin{figure*}[htbp]
\centering  
\includegraphics[width=0.95\columnwidth]{Multi_COMPAS.png}
\caption{Application of KDE method~\citep{cho2020bfair} on COMPAS dataset with multiple sensitive subgroups in two different proportions.}
\label{multi-COMPAS}

\centering  
\includegraphics[width=0.95\columnwidth]{Multi_Adult.png}
\caption{Application of KDE method~\citep{cho2020bfair} on Adult dataset with multiple sensitive subgroups in two different proportions.}
\label{multi_Adult}
\end{figure*}

\subsubsection{Comparison between SA-DRO and imbalanced learning method}

We show that one particular advantage of the proposed SA-DRO approach is the method’s flexibility in tuning the level of bias reduction, because by varying the DRO coefficient over $[0,\infty)$, the learner can explore the spectrum between the original imbalanced distribution and the fully balanced (uniform) distribution on the sensitive attribute $S$. Please note that the learner will pay the price of addressing the imbalanced distribution by a lower accuracy, and the trade-off between accuracy and bias-reduction could be controlled by varying the coefficient of the DRO regularization term in Figure~\ref{LDAM}.

\begin{figure*}[htbp]
\centering  
\includegraphics[width=0.95\columnwidth]{LDAM.png}
\caption{Application of fairness-aware LDAM~\citep{cao2019learning} targeting to balance sensitive attributes and compare with SA-DRO.}
\label{LDAM}
\end{figure*}

\subsubsection{SA-DRO for distributed image classification}

By applying SA-DRO methods on CelebA dataset in federated learning settings, we found that SA-DRO methods achieved a similar accuracy with Client 1's local model while preserving the accuracy for the majority clients and maintaining the same level of DDP, as in Table~\ref{table-celeba}.
\begin{table*}[htbp]
\centering
    \caption{Accuracy and DDP on distributed CelebA dataset}
    \resizebox{0.5\textwidth}{!}{
        \begin{tabular}{@{}lcccccc@{}}
        \toprule
         & \multicolumn{2}{c}{\textbf{Client 1 (Minority)}} & \multicolumn{2}{c}{\textbf{Client 2-5 (Majority)}}\\
        \cmidrule(l){2-3}
        \cmidrule(l){4-5}
        \cmidrule(l){6-7}
         & Acc($\uparrow$) & DDP($\downarrow$) & Acc($\uparrow$) & DDP($\downarrow$)\\
        \midrule
        FedAvg & 94.8\% & 0.380 & 94.3\% & 0.419 \\
        ERM(Local) & 91.8\% & 0.374 & 90.3\% & 0.396 \\
        \midrule
        FedKDE & 65.6\% & 0.088 & 88.8\% & 0.060 \\
        FedFACL & 67.0\% & 0.068 & 88.5\% & 0.054 \\
        \midrule
        \textbf{SA-DRO-KDE} & \textbf{69.0\%} & 0.063 & 88.1\% & 0.062\\
        \textbf{SA-DRO-FACL} & \textbf{68.5\%} & 0.057 & 87.7\% & 0.069\\
        \midrule
        KDE(Local) & 69.7\% & 0.055 & 87.6\% & 0.043\\
        FACL(Local) & 69.1\% & 0.067 & 87.5\% & 0.050 \\
        \bottomrule
    \end{tabular}
    }
    \label{table-celeba}
\end{table*}

\end{document}


% This document was modified from the file originally made available by
% Pat Langley and Andrea Danyluk for ICML-2K. This version was created
% by Iain Murray in 2018, and modified by Alexandre Bouchard in
% 2019 and 2021 and by Csaba Szepesvari, Gang Niu and Sivan Sabato in 2022.
% Modified again in 2023 by Sivan Sabato and Jonathan Scarlett.
% Previous contributors include Dan Roy, Lise Getoor and Tobias
% Scheffer, which was slightly modified from the 2010 version by
% Thorsten Joachims & Johannes Fuernkranz, slightly modified from the
% 2009 version by Kiri Wagstaff and Sam Roweis's 2008 version, which is
% slightly modified from Prasad Tadepalli's 2007 version which is a
% lightly changed version of the previous year's version by Andrew
% Moore, which was in turn edited from those of Kristian Kersting and
% Codrina Lauth. Alex Smola contributed to the algorithmic style files.

