\section{Active AMF Learning}

In this section, we propose an active AMF 
learner based on Definition \ref{def_AMFlearner} 
and derive its sample complexity. 

Our key idea is to label instances 
that are fairly close to their neighbors 
but receive fairly different predictions from 
some hypothesis. We characterize such 
instances using a set 
\begin{align}
\label{eq:contrset}
\begin{split}
\mathcal{C}_{\alpha,\beta}(H) 
= \{& (x,x') \in X \times X; \exists h \in H, \\ 
&\ d(x,x') \leq \alpha,\ |h(x) - h(x')| > \beta\}.
\end{split}
\end{align}
Next, we design a counter AMF coefficient, 
which will be used to derive the complexity. 
\begin{definition}
The counter $(\alpha,\beta)$ AMF coefficient 
with respect to a hypothesis class $H$ is 
\begin{equation}
\label{def:coef}
\xi_{\alpha,\beta} = \sup_{r > 0} \frac{\Pr \{ 
(x,x') \in \mathcal{C}_{\alpha,\beta} 
(\mathcal{B}_{\alpha,\beta}(r)) \}}{r}, 
\end{equation}
where $\mathcal{B}_{\alpha,\beta}(r) 
= \{ h \in H; \Delta_{\alpha,\beta} (h) \leq r\}$ 
is the set of hypotheses that are 
$(\alpha,\beta,r)$ AMF.
\end{definition}
Intuitively, the coefficient measures the 
largest volume of instance pairs that do not 
contribute to the fairness achievable in a 
hypothesis class. We could expect it to be 
smaller if hypotheses are more fair. 
For conciseness, we will omit the subscripts 
in $\xi_{\alpha,\beta}$ whenever they are 
clear from the context.

The proposed active AMF learner is shown  
in Algorithm \ref{alg:optPAMFL2}. In each 
round, it trains model $h$ on the labeled 
set using the AMF learner, and then labels 
instances that are close to the training 
data but receive different predictions from $h$. 
It is clear that all labeled instances 
fall in $\mathcal{C}_{\alpha,\beta}(H)$.  
The fairness coefficients $\alpha,\beta$ are 
assumed preset by the problem, and we can 
stop labeling when a desired AMF degree is achieved.  

Our following theorem shows that, under 
proper conditions, Algorithm 1 can return a model satisfying $(\alpha,\beta,\varepsilon)$ AMF through 
$O(\log \frac{1}{\varepsilon})$ labeling with 
high probability. 

\begin{theorem}
\label{thm:labelcomplexity}
Fix any $\alpha, \beta > 0$. 
If the counter $(\alpha,\beta)$ AMF 
coefficient w.r.t. $H$ is bounded, then 
with probability at least $1 - \delta$, 
any $h \in H$ returned by Algorithm 
\ref{alg:optPAMFL2} satisfies 
$\Delta_{\alpha,\beta}(h) \leq \varepsilon$ 
after $O(\log\frac{1}{\varepsilon})$ labeling. 
\end{theorem}
\begin{proof}[Proof Sketch] 
Let $V_q = \{ h \in H; 
\Delta_{\alpha, \beta}(h; S_{q}) = 0\}$ 
be the set of `perfect' AMF models at the 
end of $q$ rounds of labeling.
The goal of our analysis is to show that, 
if we label 
$k = \frac{1}{4\xi^2} \left( 32 c/\beta 
+ \sqrt{\frac{1}{2} \log \frac{1}{\delta'}}\right)$ 
instances in each round, then by the generalization 
bound in Theorem \ref{thm:generalization}, there is 
\begin{equation}
\label{eq:thm2_proof_key}
\Pr\{ \mathcal{C}_{\alpha,\beta}(V_{q+1})\} 
\leq \frac{1}{2} \Pr\{ \mathcal{C}_{\alpha,\beta}(V_{q})\}.
\end{equation}
with high probability. 
This implies $Q = \log_{2} \frac{1}{\varepsilon} $ 
rounds of labeling, which means $Q k \in O(\log \frac{1}{\varepsilon})$ total labeling, 
suffices for $\Pr\{ \mathcal{C}_{\alpha,\beta}(V_{q+1})\} \leq \varepsilon$. 
Since $\Delta_{\alpha,\beta}(h) \leq \Pr\{ \mathcal{C}_{\alpha,\beta}(V_{q})\}$ for 
any $h \in V_{q}$ by definition, the theorem 
is proved. 

Let $\&$ be logic `AND' and define event 
\begin{equation}
I_{\alpha}^{\beta}(x,x';h) 
:= d(x,x') \leq \alpha\ \&\ 
|h(x) - h(x')| > \beta. 
\end{equation}
A key to prove (\ref{eq:thm2_proof_key}) 
is to split the domain of 
$\Delta_{\alpha,\beta}(h) = \Pr\{ I_{\alpha}^{\beta}(x,x';h) \}$ for 
any $h \in V_{q+1}$ into 
$(x,x') \in \mathcal{C}_{\alpha,\beta}(V_q)$
and $(x,x') \notin \mathcal{C}_{\alpha,\beta}(V_q)$. 
Probability on the second subdomain is zero, 
and probability on the first subdomain can be 
bounded using Theorem \ref{thm:generalization} 
conditioned on the fact that all labeled instances 
fall in $\mathcal{C}_{\alpha,\beta}(V_q)$. 
That bound is smaller than $\frac{1}{2 \xi}$ 
by our choice of $k$ and the definition of $\xi$, 
therefore implying $V_{q+1} \subseteq \mathcal{B}\left(\frac{\Pr\{ \mathcal{C}_{\alpha,\beta}(V_q) 
\}}{2 \xi}\right)$ and thus 
$\Pr\{ \mathcal{C}_{\alpha,\beta}(V_{q+1})\}  \leq \Pr\left\{ \mathcal{C}_{\alpha,\beta} \left( \mathcal{B}_{\alpha,\beta}\left(\frac{\Pr\{ 
\mathcal{C}_{\alpha,\beta}(V_q) \}}{
2 \xi}\right)\right)\right\}
\leq \xi \cdot \frac{\Pr\{ \mathcal{C}_{\alpha,\beta}(V_q) \}}{2 \xi}
= \frac{\Pr\{\mathcal{C}_{\alpha,\beta}(V_q) \}}{2}$, 
where the second inequality is by definition. 
This proves (\ref{eq:thm2_proof_key}) and 
thus the theorem. 
\end{proof}

The proof of Theorem \ref{thm:labelcomplexity} 
also illuminates the key for Algorithm 1 
to reduce labeled instances is in Step 3, 
where we label $u$ if $(u, u') \in \mathcal{C}_{\alpha,\beta}(V_q)$ because only 
such pair can be used to further rule out 
hypotheses in $V_{q}$ and shrink $\mathcal{C}_{\alpha,\beta}(V_q)$, 
which guarantees the shrinkage 
of $\Delta_{\alpha,\beta}(h)$.

We should mention an implicit assumption 
of the derived sample complexity is that, 
the unlabeled set contains at least 
one instance satisfying (\ref{eq:query}) 
per epoch until convergence. This is similar 
to the analysis of disagreement-based 
active learning \cite{hanneke2014theory}, which
assumes at least one unlabeled instance is 
disagreed by the committee models per epoch. 
From a practical perspective, when no valid 
instance is found, we could train another model 
or randomly label one instance and 
proceed to the next epoch. 


We should also mention the time complexity for 
Algorithm 1 to find an instance satisfying (\ref{eq:query}). 
In a centralized computing environment, the complexity 
is $O(|U| |L|)$, where $|U|$ is the size of unlabeled
set and $|L|$ is the size of labeled set. Typically 
$|L| \ll |U|$. This is higher than the complexity of 
uncertainty-based strategy which is typically $O(|U|)$, 
but more comparable to the complexity of query-by-committee 
which is typically $O(|U| t)$ for $t$ committee models. 
In a distributed computing environment, the complexity can 
be reduced to $O(|U|)$ if the evaluations of an instance 
$u \in U$ with all $u' \in L$ can be parallelized. 
Nonetheless, how to make selection more efficient 
remains an open challenge. 



\begin{figure}[t!]
     \centering
     \includegraphics[width=.35\textwidth]{figure/Xi_Demo_v2.PNG}
     \caption{Visualization of 
     $\Pr_{x}(I_{\alpha}^{\beta}(x, 0; h))$.}
     \label{fig:xi_demo}
\end{figure}



