\subsection{AMF Learning with Performance 
Generalization Guarantee}

In this section, we present a passive AMF learner 
based on Definition \ref{def:metricfair} and prove
its generalization guarantee. 

To facilitate discussion, define the fairness measure 
\begin{equation}
\label{eq:truedelta}
\Delta_{\alpha,\beta}(h) = \Pr 
\{ d(x, x') \leq \alpha, |h(x) - h(x')| > \beta \}. 
\end{equation}
Then $h$ is said to be 
$(\alpha, \beta,\varepsilon)$-AMF if 
$\Delta_{\alpha,\beta}(h) \leq \varepsilon$. 

Let $S$ be a sample of $X \times X$ with 
cardinality $m$. An estimate of the 
probability $\Delta_{\alpha,\beta}(h)$ on sample $S$ is 
\begin{align}
\label{eq:empdelta}
\begin{split}
\Delta_{\alpha,\beta}(h; S)
= \frac{1}{m} \sum_{(x,x') \in S}
\mathbb{I} \{ & d(x, x') \leq \alpha, \\[-1em] 
&\ |h(x) - h(x')| > \beta \},  
\end{split}
\end{align}
where $\mathbb{I}$ is an indicator function. 

It is natural for AMF learning to 
find a model $h$ with small $\Delta_{\alpha,\beta}(h;S)$ 
and hope this could generalize to a 
small $\Delta_{\alpha,\beta}(h)$. 
In this paper, we focus on a realizable 
case where $H$ contains perfect AMF models 
that satisfy $\Delta_{\alpha,\beta}(h) = 0$. 
Based on this, we define the passive 
AMF learner as follows. 
\begin{definition}
\label{def_AMFlearner}
Given a hypothesis class $H$, a loss function $\ell$ 
and a labeled training set $L = \{ (x_1, y_n), \ldots, (x_n, y_n) \}$ where $x_i$ is the $i_{th}$ instance 
and $y_i$ is its label, an AMF learner returns 
a model $h \in H$ by solving 
\begin{equation}
\label{eq:optPAMFL}
\min_{h \in H}\ \frac{1}{n} \sum_{i=1}^n 
\ell(h(x_i), y_i),\quad \text{s.t.}\ 
\Delta_{\alpha,\beta}(h; S) = 0,  
\end{equation}
where $S = \{ (x_i, x_j) \}_{i,j = 1, \ldots, n}$. 
\end{definition}

We can show the above AMF learner has 
a similar generalization guarantee as 
in \cite{yona2018probably} based on the 
following lemma. 
Let $\mathcal{R}_m(\cdot)$ denote the Rademacher
complexity of some hypothesis class for sample size $m$. 
\begin{lemma}
\label{lem:tool_generalization}
Fix any $t, \beta > 0$. 
Let $F: X \times X \rightarrow \mathbb{R}$ 
be a hypothesis class induced from $H$ 
such that $\forall f \in F$, $f(x,x') 
= \tau_{\beta}^t (|h(x) - h(x')|)$ where 
$\tau_{\beta}^{t}(z)$ is a piecewise model 
outputting $1$ if $z > \beta + \frac{1}{t}$, outputting 
$0$ if $z \leq \beta$ and $t(z-\beta)$ otherwise. 
Then $\mathcal{R}_m(F) \leq 8t \cdot \mathcal{R}_{m}(H)$.
\end{lemma}
\begin{proof}[Proof Sketch]
Repeatedly apply the Rademacher complexity property 
of composite function with Lipschitz condition e.g. \cite[Theorem 12]{bartlett2002rademacher} on $\tau_{\beta}^t$ and $abs$. See the supplementary 
material for details. 
\end{proof}

Based on the above, we can prove the proposed AMF learner 
has generalization guarantee based on an assumption that 
instances are sampled i.i.d.. The results is as follows. 
\begin{theorem}
\label{thm:generalization}
Fix any $\alpha, \beta, t > 0$. 
Suppose $\mathcal{R}_m(H) \in O(1/\sqrt{m})$. 
Any model $h \in H$ returned by the AMF learner 
satisfies $\Delta_{\alpha,\beta + 1/t}(h) 
\leq \varepsilon$ with probability 
at least $1 - \delta$ if 
$m \geq \frac{1}{\varepsilon^2}
\left( 16 t c + \sqrt{\frac{1}{2}
\log \frac{1}{\delta}}\right)$, 
where $m$ is the number of 
$(x,x') \in S$ satisfying 
$d(x,x') \leq \alpha$ 
and $c$ is a constant inherited 
from $O(1/\sqrt{m})$. 
\end{theorem}


\begin{proof}[Proof Sketch] 
The main challenge in our analysis 
is an extra $d(x,x') \leq \alpha$ term 
that cannot be directly removed using 
the Rademacher complexity property 
as in Lemma \ref{lem:tool_generalization}. 
To tackle this, we introduce 
$V = \{(x,x') \in S; d(x,x') \leq \alpha \}$. 

We will first transform the analysis of 
joint event $|h(a)-h(b)| > \beta$ and $d(a,b) 
\leq$ to an analysis of single event 
$|h(a)-h(b)| > \beta$ by narrowing the 
domain to $V$. 
Then, we derive a generalization bound 
for the single event by first 
relaxing its indicator function to the 
piecewise function defined in Lemma 
\ref{lem:tool_generalization}, then 
applying the standard generalization argument with $\mathcal{R}_m(F)$ e.g., \cite{mohri2018foundations}, 
and finally  
connecting $\mathcal{R}_m(F)$ to $\mathcal{R}_m(H)$ using Lemma \ref{lem:tool_generalization}. 
At the end, we transform the result for the 
single event back to a result for the joint 
event which completes the proof. See 
the supplementary material for details. 
\end{proof}

Theorem \ref{thm:generalization} implies one can 
achieve $(\alpha,\beta,\varepsilon)$ AMF with 
$O(\frac{1}{\varepsilon^2})$ randomly
labeled instances, which is consistent 
with the sample complexity in \cite{yona2018probably}. 
Constant $c$ depends on the hypothesis class 
e.g., if $H$ is the set of linear models with 
proper constraints, we can set $c$ to the 
maximum norm of the instance \cite{shalev2014understanding}; 
if $H$ is the set of kernel machines with 
proper constraints, we can set $c$ to the 
product of kernel function bound and gram 
matrix trace \cite{mohri2018foundations}.  

In the theorem, variable $t$ is the slope 
of a Lipschitz function introduced to approximate 
the indicator function. Its impact on 
the error bound is interesting twofold. 
A smaller $t$ leads to a weaker 
fairness guarantee, in a sense that 
$\Delta_{\alpha,\beta+1/t'} \leq \varepsilon$ 
implies $\Delta_{\alpha,\beta+1/t} \leq \varepsilon$ 
whenever $t' \leq t$. But it also leads to 
higher sample efficiency, in a sense that 
a smaller $t$ implies smaller $m$ suffices 
for the generalization guarantee.


\begin{algorithm}[t!]
\caption{Active AMF Learning}
\begin{algorithmic}[1]
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Output:}}
\renewcommand{\algorithmicloop}{\textbf{Loop:}}
\REQUIRE 
an initial labeled training set $L$, an unlabeled set $U$, 
a hypothesis class $H$, number $k$.   
\WHILE{stopping criterion is not met} 
\STATE Learn a model $h \in H$ based on sample $L$ 
using the AMF learner in Definition \ref{def_AMFlearner}. 
\STATE Pick an i.i.d. sample of $k$ instances $u \in U$ satisfying 
\begin{equation}
\label{eq:query}
\exists u' \in L,\ d(u,u') \leq \alpha,\ 
|h(u) - h(u')| > \beta.     
\end{equation}

\STATE Label the selected instances. Then 
add them to sample $L$, and remove them from sample $U$. 
\ENDWHILE
\ENSURE model $h$. 
\end{algorithmic} 
\label{alg:optPAMFL2}
\end{algorithm}


