\section{Experiment}

We experiment on three real-world data sets. 
The Insurance data set\cite{data_insurance}
has individual medical costs billed by health insurance 
company, and the task is to predict the cost based other 
attributes. The Life data set\cite{data_life} 
has the life expectancy in different countries, 
and the task is to predict the expectancy. 
We also use a data set collected from public 
resources. It contains the COVID death rates 
of 3142 counties in United States and the task 
is to predict the rate based on other attributes 
including population density, obesity rate, 
smoking rate, diabetes rate, elderly population 
and vaccine rate. To learn more data sets used 
to evaluate algorithmic fairness, we refer 
interested readers to \cite{le2022survey}. 

We encode categorical features by dummy
variables, address missing data using mean 
imputation and standardize all features. 
For higher numerical stability, we re-scale 
the labels: on the Insurance data set, we 
divide the medical cost which varies from 
4k to 40k by 10k; on the Life data set, we 
divide the life expectancy which varies 
from 40 to 90 by 100; on the COVID data set, 
we multiply the death rate which varies from 
0 to 0.01 by 100. 

We randomly split each data set into 
an initial training set (assumed labeled), 
an unlabeled set (for query) and a testing set. 
Size of the initial training set is chosen 
as follows: for the linear base model, it is 
the feature number on the Insurance and Life 
data sets, and twice that number on the COVID 
data set; for the rff base model, it is half 
of the random feature number. 
Size of the testing data is 25\% of 
the total data size. The remaining data are 
treated as unlabeled. 

On a data set, we run an active learner 
for 20 random trials and report the average 
model performance on the testing sets. 
Model bias is measured by 
$\Delta_{\alpha,\beta}(h; S_n)$ defined 
in (\ref{eq:empdelta}), with $(\alpha,\beta)$ 
set to (2, 0.1) on Insurance, (10, 0.2) on 
Life and (1.5, 0.001) on COVID. We also 
experimented with other fairness coefficients 
and observed similar comparative performance. 
Model error is measured by the root-mean-squared-error. 

We evaluate the proposed active labeling 
strategy on the linear base model and 
rff base model respectively, and compare 
its performance with the following three 
strategies. 

-- \textit{Random}: It randomly selects instances to label. 

-- \textit{Query-by-Committee (QBC)}: 
It labels instances which receive the 
largest prediction variance from a committee 
of models. Following \cite{burbidge2007active}, 
we construct a committee of five models and 
train each one using a bootstrap sample of the 
training data, with sample size equals to the 
training set size divided by the committee size. 

-- \textit{Uncertainty}: It labels instances which are 
most different from the training data in both 
feature space and label space \cite{wu2019active}. 
To our knowledge, this is a state-of-the-art 
active labeling method for regression model. 

-- \textit{Cluster}: It is a clustering based 
baseline method that relies on the distance 
between instances. It first identifies
the top $m$ uncertain instances in the candidate 
pool using the above method, then runs k-means clustering to identify their $k$ centers, and finally labels the identified instances. 

For the metric-fair learner, we pick its 
regularization coefficient $\lambda$ 
that strikes a good balance between fairness 
and accuracy. For the linear base model, 
$\lambda$ is set to 1 on Insurance and Life and 0.1 
on COVID; for the rff base model, $\lambda$ is set to 1 
on Insurance, 5 on Life and 0.5 on COVID. 

For the rff base model, we generate the random 
features that approximate Gaussian kernel  \cite{rahimi2007random}. The random feature number 
is set to 100 on Insurance, 400 on Life and 200 
on COVID. The gamma coefficient is set to 1e-4 on Insurance, 1e-9 on Life and 1e-2 on COVID. In practice, 
we observe these configurations lead to good and 
stable performance of active metric-fair learning. 
For the clustering based baseline method, we 
set $m = 10$ and $k = 3$ as they give consistently 
good performance (except $k$ is set to $10$ for linear 
model on the Life dataset). 


\subsection{Results and Discussions}

Results of the experimented strategies on both 
base models across three data sets are 
shown in Figure \ref{fig:expresults}. 

In Figure \ref{fig:expresults} (a-f), 
we see the proposed active AMF learner 
reduces model bias more efficiently than other 
learners, which empirically verifies its 
efficient sample complexity. We notice 
it achieves almost zero bias in all cases, 
supporting our assumption on the realizable case. 
(And note this is not achieved at the cost of 
significantly deteriorating accuracy, as explained 
in the next paragraph.)  
There seems no consistent pattern on the 
efficiency of other learners. We notice  
QBC and uncertainty are often less efficient 
than random, implying the importance of 
(efficiently) achieving individual fairness 
by design, as presented in this study. 

In Figure \ref{fig:expresults} (g-l), it is 
not surprising to see that uncertainty based 
labeling reduces error faster than other strategies.
Comparatively, the proposed active AMF learner 
manages to achieve a comparable reduction rate,  
suggesting its has an efficient fairness-accuracy 
trade-off. 

We also perform sensitivity analysis on the 
proposed strategy and present results in 
Figure \ref{fig:sensitivity}. 
Figures \ref{fig:sensitivity} (a-b) 
show the performance versus 
regularization coefficient $\lambda$. We 
both training and testing $\delta$ decrease 
as $\lambda$ increases. This suggests the
metric-fair learner can effectively reduce 
bias and the reduction is generalizable, which 
supports Theorem \ref{thm:generalization}.  
We also see model error first decreases and 
then increases, exhibiting an overfitting 
phenomenon.

Figures \ref{fig:sensitivity} (c-d) 
show the performance versus 
different choices of $(\alpha,\beta)$ 
when selecting instances in Step 3 
of Algorithm 1. (But all $\delta$'s 
are evaluated based on the same 
$(\alpha,\beta)$ for fair comparison.)
We see using smaller $\alpha$ to 
select instances leads to faster 
convergence of $\delta$ but more slowly 
convergence of RMSE. There seems no 
clear pattern on the impact of $\beta$. 
Overall, we see one can balance fairness 
and accuracy of the proposed strategy through 
adjusting $\alpha$. 


\begin{figure*}[h!]
     \centering
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/linear_insurance.png}
         \caption{Bias of linear model on Insurance}
     \end{subfigure}
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/linear_life.png}
         \caption{Bias of linear model on Life}
     \end{subfigure}
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/linear_covid.png}
         \caption{Bias of linear model on COVID}
     \end{subfigure}
     \\
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/rff_insurance.png}
         \caption{Bias of rff model on Insurance}
     \end{subfigure}
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/rff_life.png}
         \caption{Bias of rff model on Life}
     \end{subfigure}
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/rff_covid.png}
         \caption{Bias of rff model on COVID}
     \end{subfigure}     
     \\
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/linear_insurance_err.png}
         \caption{Error of linear model on Insurance}
     \end{subfigure}
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/linear_life_err.png}
         \caption{Error of linear model on Life}
     \end{subfigure}
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/linear_covid_err.png}
         \caption{Error of linear model on COVID}
     \end{subfigure}
     \\
     \centering
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/rff_insurance_err.png}
         \caption{Error of rff model on Insurance}
     \end{subfigure}
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/rff_life_err.png}
         \caption{Error of rff model on Life}
     \end{subfigure}
     \begin{subfigure}{.33\textwidth}
         \centering
         \includegraphics[width=\textwidth]{figure/rff_covid_err.png}
         \caption{Error of rff model on COVID}
     \end{subfigure}
     \caption{Performance of Different Active 
     Labeling Strategies on Three Data Sets. 
     (a-c) and (d-f) show the bias of linear 
     and rff base models respectively; 
     (g-i) and (j-l) show the rmse of linear 
     and rff base models respectively.}
    \label{fig:expresults}
\end{figure*}



% \begin{figure*}[h!]
%      \centering
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/a.png}
%          \caption{Bias of linear model on Insurance}
%      \end{subfigure}
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/b.png}
%          \caption{Bias of linear model on Life}
%      \end{subfigure}
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/c.png}
%          \caption{Bias of linear model on COVID}
%      \end{subfigure}
%      \\
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/d.png}
%          \caption{Bias of rff model on Insurance}
%      \end{subfigure}
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/rfflifebias.png}
%          \caption{Bias of rff model on Life}
%      \end{subfigure}
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/rffcovidbias.png}
%          \caption{Bias of rff model on COVID}
%      \end{subfigure}     
%      \\
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/g.png}
%          \caption{Error of linear model on Insurance}
%      \end{subfigure}
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/h.png}
%          \caption{Error of linear model on Life}
%      \end{subfigure}
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/i.png}
%          \caption{Error of linear model on COVID}
%      \end{subfigure}
%      \\
%      \centering
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/j.png}
%          \caption{Error of rff model on Insurance}
%      \end{subfigure}
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/rfflifermse.png}
%          \caption{Error of rff model on Life}
%      \end{subfigure}
%      \begin{subfigure}{.33\textwidth}
%          \centering
%          \includegraphics[width=\textwidth]{figure/rffcovidrmse.png}
%          \caption{Error of rff model on COVID}
%      \end{subfigure}
%      \caption{Performance of Different Active 
%      Labeling Strategies on Three Data Sets. 
%      (a-c) and (d-f) show the bias of linear 
%      and rff base models respectively; 
%      (g-i) and (j-l) show the rmse of linear 
%      and rff base models respectively.}
%     \label{fig:expresults}
% \end{figure*}

