\newpage
\appendix
\section{Model descriptions}
\label{subsec: appendix_1}

As we describe in section \ref{sec:methodology}, the latent representation learning method proposed in the original paper achieves certified individual fairness by training the deep learning model in a way that a specific sensitive attribute is ignored in the classification process. This can be done by obtaining images of individuals who are identical but only differ in the attribute for which individual fairness needs to be certified. If these similar individuals are treated identically by the model during the training process, it is ensured that the classification is not based on the sensitive attribute but only on all the other attributes.

\subsubsection{GLOW} 

To obtain images of faces that only differ in a certain sensitive attribute, Peychev et al. \cite{peychev2022latent} use the generative GLOW model \cite{kingmaglow} to generate these similar faces. The GLOW model is able to alter the input data in the latent space, along a chosen attribute vector. Examples of the output of the GLOW model for the sensitive attributes used in the original paper can be seen in figure \ref{fig:glow_original_output}. We show the examples of GLOW output generated by us for the additional experiments in figure \ref{fig:glow_input}. The GLOW model used in this reproducibility study is pre-trained and could be readily deployed.

\begin{figure}[H]
    \centering
    \includegraphics[width=0.95\textwidth]{figures/attributespeychev.png}
    \caption{Visualizations of the input generated by the GLOW model for a face in the CelebA dataset. The attributes in these figures are the sensitive attributes and their combinations used by the authors in the original report \cite{peychev2022latent}.}
    \label{fig:glow_original_output}
\end{figure}

\subsubsection{LASSI} By treating all the variations of an image generated by the GLOW model in the same manner, the LASSI model learns a fair representation of a specific task. It is ensured that the model learns to discriminate only based on features other than the sensitive attribute, which the GLOW model attempts to alter as little as possible. This adversarial training is performed by uniformly selecting points on the sensitive attribute vector with a maximum perturbation level, and train the model based on these images, to ensure fair treatment. \newline

% By treating all the images generated by the GLOW model the same as the original individual, the LASSI model learns a fair classification of a specific task. This ensures that the model learns to discriminate only based on all the other attributes, because the GLOW model attempts to alter those attributes as little as possible based on the perturbation level. This adversarial training is performed by uniformly selecting points on the sensitive attributes vector with a maximum perturbation level, and train the model based on these images, to ensure fair treatment. \newline

To balance fairness, accuracy, and the ability to transfer to unknown downstream tasks, an optimal value of different losses has to be found. The adversarial loss and the loss of a reconstruction network from the representation to the latent space are added to the classification loss which emerges from the classification of the original face. The overall loss of the training is therefore a weighted average of the three losses, with hyperparameters $\lambda_1$, $\lambda_2$ and $\lambda_3$ serving as the weights for the losses of the auxiliary classifier.

\subsubsection{Naive}

To compare the performance of LASSI to a baseline, the original authors also trained a fairness-unaware baseline model, denoted as the naive model. For this naive model, the representation is learned with the loss of adversarial training and the loss of the reconstruction network turned off ($\lambda_1 = \lambda_2 = 0$), such that it only learns from the classification loss of the original, unaltered faces.

\section{Datasets}
\label{subsec: appendix_2}

From the original studies we selected two datasets to use in this reproducibility studies. In table \ref{tab:dataset_table} we present a short overview of both datasets.

% Dataset tables

\begin{table}[H]
\centering
\begin{tabularx}{\linewidth}{l l l X}
\toprule
\textbf{Dataset} & \textbf{Size} & \textbf{Features} &  \textbf{Description} \\
\midrule
CelebA \cite{liu2015deep} & 202,599 & 40 & \textit{A large-scale face dataset consisting of celebrity images annotated on 40 attributes. The images cover large pose variations and background clutter.} \\
\midrule
FairFace \cite{karkkainen2021fairface} & 97,698 & 3 & \textit{A large-scale face images dataset, annotated on three attributes and balanced on the race attribute.} \\
\bottomrule
\end{tabularx}
\caption{\label{tab:dataset_table} Overview of the datasets used in this reproducibility studies.}
\end{table}

\subsubsection{CelebA} 

The CelebA dataset \cite{liu2015deep} is a large scale dataset containing over 200 thousand images of real world celebrities. Each image is stored as a jpg-file and is annotated with 40 features such as Pale\_Skin, Big\_Lips, Smiling and Wearing\_Hat. The full list can be found on the dedicated website.\footnote{CelebA dataset: \url{https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html}} Each attribute is annotated with a score of 1 when the attribute is present in the image, or a -1 otherwise. The attractiveness score is determined by human input.


% The celebA dataset contains of 202,599 images that are saved as jpg files. Each image has 40 features : 5\_o\_Clock\_Shadow , Arched\_Eyebrows, Attractive , Bags\_Under\_Eyes, Bald, Bangs, Big\_Lips, Big\_Nose, Black\_Hair, Blond\_Hair, Blurry, Brown\_Hair, Bushy\_Eyebrows, Chubby, Double\_Chin, Eyeglasses, Goatee, Gray\_Hair, Heavy\_Makeup, High\_Cheekbones, Male, Mouth\_Slightly\_Open, Mustache, Narrow\_Eyes, No\_Beard, Oval\_Face, Pale\_Skin, Pointy\_Nose, Receding\_Hairline, Rosy\_Cheeks, Sideburns , Smiling, Straight\_Hair, Wavy\_Hair, Wearing\_Earrings, Wearing\_Hat, Wearing\_Lipstick, Wearing\_Necklace ,Wearing\_Necktie and Young. And if the feature is present in the image, then it will have a 1 score and a -1 score otherwise. The attractiveness score is determined by human input.\newline

\subsubsection{FairFace}

The FairFace dataset \cite{karkkainen2021fairface} was created to mitigate the race bias problem in most public face image datasets. It contains of close-to 100 thousand images, balanced on the race attribute.\footnote{FairFace dataset: \url{https://github.com/joojs/fairface}} The images are stored as jpg-files and annotated with four features: age, gender, race and service\_test, of which only the first three are used. The possible values of these annotated features are summarized in table \ref{tab:fairface_features}.

\begin{table}[H]
\centering
\resizebox{\columnwidth}{!}{\begin{tabular}{ll}
\toprule
\textbf{Feature} & \textbf{Values} \\
\midrule
Age & [0-2], [3-9], [10-19], [20-29], [30-39], [40-49], [50-59], [60-69] or [70+] \\
Gender & Male, Female \\
Race & White, Black, Indian, East Asian, Southeast Asian, Middle Eastern or Latino \\
\bottomrule
\end{tabular}}
\caption{\label{tab:fairface_features} Summary of the possible feature values in the fairface dataset.}
\end{table}

\section{Outlier study} \label{sec:appendix_outliers}

From the additional experiments presented in Section \ref{sec:reproduced_results} we find a significant drop in the fairness score achieved by the LASSI model for the sensitive attributes \texttt{Chubby} (5.6\%) and \texttt{Bald} (10.9\%) on the task \texttt{Attractive}. To examine these outliers we look at the possible correlation between these sensitive attributes and the task; in addition, we visualize the input generated by the GLOW model for these attributes. \newline

\subsubsection{Correlation} 

In table \ref{tab:percentages} we present the calculations demonstrating which percentage of a certain attribute positively corresponds with a task. For example, 55.5\% of faces annotated as chubby are smiling, but only 3.3\% of faces annotated as chubby are tagged as attractive in the CelebA dataset. \newline

Interesting numbers here include that only 8.2\% of faces with \textit{Narrow\_Eyes} are wearing a hat. However, it is important to note that the total number of people in the dataset tagged as \textit{Wearing\_Hat} is only 4.8\%, indicating that this 'outlier' follows a general trend of only few people wearing a hat. \newline

To account for this, in the 'ratio' column in table \ref{tab:percentages} we present the ratio with regards to the total amount of people annotated with a certain attribute. For example, 49.4\% of people with \textit{Big\_Lips} is annotated to be smiling, which follows the trend of the whole database in which 48.2\% of people is smiling. The ratio presented here is then calculated as 49.4 divided by 48.2 to retrieve a ratio of 1.02. A ratio close to 1 therefore indicates following a similar trend to the full database.

\begin{table}[h]
\small
\centering
\begin{tabular}{lccccccccc}
\toprule
Task: & \multicolumn{2}{c}{\texttt{Smiling}} & \multicolumn{2}{c}{\texttt{Wearing\_Hat}} & \multicolumn{2}{c}{\texttt{Attractive}} & \multicolumn{2}{c}{\texttt{Necklace}} \\
Sensitive attrib. & \% & Ratio & \% & Ratio & \% & Ratio & \% & Ratio \\
\cmidrule(rl){1-1} \cmidrule(rl){2-3} \cmidrule(rl){4-5} \cmidrule(rl){6-7} \cmidrule(rl){8-9}
\texttt{Chubby} & 55.5 & \colorbox{mygreen}{1.15} & 19.9 & \colorbox{mygreen}{4.15} & 3.3 & \colorbox{myred}{15.55} & 11.6 & \colorbox{mygreen}{1.06} \\
\texttt{Bald} & 51.3 & \colorbox{mygreen}{1.06} & 1.0 & \colorbox{mygreen}{4.80} & 3.1 & \colorbox{myred}{16.55} & 2.7 & \colorbox{mygreen}{4.56} \\
\texttt{Big\_Lips} & 49.4 & \colorbox{mygreen}{1.02} & 8.7 & \colorbox{mygreen}{1.81} & 56.8 & \colorbox{mygreen}{1.11} & 42.0 & \colorbox{mygreen}{3.15} \\
\texttt{Narrow\_Eyes} & 59.1 & \colorbox{mygreen}{1.23} & 8.2 & \colorbox{mygreen}{1.71} & 41.0 & \colorbox{mygreen}{1.25} & 30.2 & \colorbox{mygreen}{2.46} \\
\cmidrule(rl){1-1} \cmidrule(rl){2-3} \cmidrule(rl){4-5} \cmidrule(rl){6-7} \cmidrule(rl){8-9}
\textbf{Total:} & \multicolumn{2}{c}{48.2\%} & \multicolumn{2}{c}{4.8\%} & \multicolumn{2}{c}{51.3\%} & \multicolumn{2}{c}{12.3\%} \\
\bottomrule
\end{tabular}
\caption{\label{tab:percentages} The calculated percentages and ratios of attributes and corresponding tasks presented it table \ref{tab:new_results}. Highlighted in green are the lower values that follow trends similar to the full dataset, highlighted in red are the higher values, following a different trend.}
\end{table}

Two values in this table that stand out are \textit{Chubby} and \textit{Bald} on the attractiveness task; which are also the two attributes and task that the LASSI model could not achieve high individual fairness on (see table \ref{tab:new_results}). The ratios here are 15.55 and 16.55 respectively, which means that for every chubby face and bald face that is tagged as attractive there are over 15 faces that are \textit{not} chubby or \textit{not} bald and tagged attractive. \newline

These results indicate that a face being chubby or bald has too much influence on being tagged as attractive in this dataset, preventing LASSI to achieve high certified individual fairness while maintaining high accuracy.

\subsubsection{Causation}

A possible explanation for this, is that when the LASSI model is trained on these tasks and attribute perturbations, the latent representations of faces that differ in the specific attribute are likely to diverge to a significant extent. This is done to ensure a higher prediction accuracy, because the faces that differ in the attribute must also differ in the class of the task, given the above-mentioned strong correlation. \newline
    
This leads to low fairness as the perturbation results in vastly divergent data representations. In contrast, attributes that are ethically neutral, such as smiling, do not pose a concern in this regard. From an ethical perspective however, examples such as perturbations in chubbiness should not affect predictions of attractiveness. \newline

We conclude that our experiments with LASSI produced poor individual fairness under certain settings, due to the highly biased relation between some attributes and tasks. \newline

\subsubsection{Corrupted input}

Another explanation for the low individual fairness achieved by LASSI under certain settings, is possibly corrupted input generated by the GLOW model. To do this, we select a random sample of faces, visualize the input generated by GLOW and calculate the certified individual fairness scores achieved by the LASSI model. The code to do this is presented on the project GitHub repository.\footnote{Reproducibility studies GitHub page: \url{https://mametchiii.github.io/lassi-reproducibility/}} \newline

Similar to the visualizations we present in \ref{subsec:vis} we find that LASSI also achieves 0\% fairness scores for this random sample of faces varied on the 'Bald' sensitive attribute. The figure shows that the faces are not only altered in their baldness, but also in various other facial features, likely impacting the classification task at hand.

\begin{figure}[H]
\centering
    \includegraphics[scale=0.85]{figures/baldfairness.png}
    \caption{Random sample of faces generated by the GLOW model, varied on the sensitive attribute baldness. The individual fairness achieved by LASSI is 0\% for all faces in this random sample, likely caused by the highly altered input data.}
    \label{fig:bald_vis}
\end{figure}

We conclude that the two settings in which LASSI achieves a low certified individual fairness do not comprise the robustness of LASSI, but are likely caused by highly correlated data and corrupted input data generated by the GLOW model.

% The lack of variance in these features may have also led the GLOW model to generate non-sensible face inputs for LASSI, contributing to the lack of fairness in certain cases (as documented in figure \ref{fig:bald_vis}).\\
% \\
%\texttt{STILL THINKING ABOUT THIS PART} The third and fourth highest correlated combinations of sensitive attribute and task are respectively \textit{Bald} and \textit{Wearing\_Hat}, and \textit{Bald} and \textit{Wearing\_Necklace}. With ratios of \textit{1:4.8} and \textit{1:4.6}, the obtained fairness is respectively 99.8\% and 98.7\%. From these results it seems that the performance of LASSI drops fast when it exceeds a discrepancy in the ratio somewhere between \textit{1:4.8} and \textit{1:15.5}. Further research has to be conducted to find the estimated ratio for which LASSI fails to achieve fairness. 

% As shown in \ref{sec:reproduced_results} a significant drop in LASSI's fairness score is observed for the sensitive attributes \textit{Chubby} (5.6\%) and \textit{Bald} (10.9\%) on the task \textit{Attractive}. As shown in table \ref{tab:percentages} these are scenarios where the sensitive attributes are highly correlated with the predicted task.  51.3\% of the people are classified as attractive while only 3.3\% and 3.1\% of respectively the chubby and bald faces are classified as attractive. This means that the ratio between all faces and the chubby and bald faces when labeled as attractive is respectively 1:16.5 and 1:15.5. From these results it seems that being bald or being chubby has too much influence on being attractive for LASSI to achieve high fairness while maintaining high accuracy. In figure \ref{fig:chubby_attractive} the percentage of \textit{Chubby} people are shown when containing a certain attribute.\\

% \begin{figure}[H]
%     \begin{center}
%     \hspace*{0cm}
%     \includegraphics[scale = 0.6]{figures/BarChart_attractiveness.png}
%     \caption{The percentage of faces being classified as attractive from the group of faces containing a certain attribute}
%     \label{fig:chubby_attractive}
%     \end{center}
% \end{figure}


% A possible explanation for this effect is that if the LASSI model is trained on these tasks and attribute perturbations, the latent representations of faces that differ in the attribute are likely to diverge to a significant extent to ensure a higher prediction accuracy, because the faces that differ in the attribute must also differ in the class of the task, given the above-mentioned strong correlation. This leads to low fairness as the perturbation results in vastly divergent data representations. In contrast, attributes that are ethically neutral, such as smiling, do not pose a concern in this regard. However, examples such as perturbations in chubbiness should not affect predictions of attractiveness from an ethical perspective. Our experiment with LASSI probably produced poor fairness due to the highly biased relation between chubbiness and attractiveness in the training dataset (only 3\% of chubby individuals were tagged as attractive). The lack of variance in these features may have also led the GLOW model to generate non-sensible face inputs for LASSI, contributing to the lack of fairness in certain cases (as documented in figure \ref{fig:bald_vis}).

% \begin{figure}[H]
% \centering
%     \includegraphics[scale=0.7]{figures/baldfairness.png}
%     \caption{Visualisations of the variations that are generated by the GLOW model with the bald attribute.}
%     \label{fig:bald_vis}
% \end{figure}

% Nevertheless, as these issues are rooted in the defects of the CelebA dataset, further evidence is required if one wants to negate the claims made by the authors of \cite{peychev2022latent}. Therefore, we suggest that a more balanced dataset with more variations in the relation of the sensitive attributes and the tasks can be collected and implemented in future research on LASSI.

\section{Visualizations of additional experiments}\label{sec:extra_vis}

\begin{figure}[H]
\centering
\begin{subfigure}{\textwidth}
  \centering
  \vspace{2mm}
  \includegraphics[width=.95\linewidth]{figures/biglips.png}
  \caption{\texttt{Big\_Lips}}
  \label{fig:big_lips_apdx}
\end{subfigure}
\begin{subfigure}{\textwidth}
  \centering
  \vspace{2mm}
  \includegraphics[width=.95\linewidth]{figures/narrow_eyes.png}
  \caption{\texttt{Narrow\_Eyes}}
  \label{fig:narrow_eyes_apdx}
\end{subfigure}
\begin{subfigure}{\textwidth}
  \centering
  \vspace{2mm}
  \includegraphics[width=.95\linewidth]{figures/chubby.png}
  \caption{\texttt{Chubby}}
  \label{fig:chubby_apdx}
\end{subfigure}
\begin{subfigure}{\textwidth}
  \centering
  \vspace{2mm}
  \includegraphics[width=.95\linewidth]{figures/bald.png}
  \caption{\texttt{Bald}}
  \label{fig:bald_apdx}
\end{subfigure}
\begin{subfigure}{\textwidth}
  \centering
  \vspace{2mm}
  \includegraphics[width=.95\linewidth]{figures/indian_fairface.png}
  \caption{\texttt{Race=Indian}}
  \label{fig:indian_apdx}
\end{subfigure}
\caption{Visualizations of the input generated by the GLOW model for a face from the CelebA dataset and a face from the FairFace dataset. The attributes in these figures are the sensitive attributes we use in our additional experiments.}
\label{fig:larger_visualisations}
\end{figure}

As described in \ref{sec:own_contrib} we performed additional experiments with new tasks and sensitive attributes. Since we have seen that the outcome of the GLOW model can be unpredictable under some settings, the output of the GLOW model with the used sensitive attributes is visualized, to evaluate if this output is not corrupted. This is important in the analysis of the LASSI model \newline

% Only if the GLOW model generates images that are in line with the sensitive attributes we are able to draw conclusions from the performance of LASSI.

In figure \ref{fig:larger_visualisations} we observe two visualizations which are not only altered in the corresponding sensitive attribute, namely \texttt{Chubby} and \texttt{Bald}. Their visualizations looks similar, and a change in multiple different attributes can be observed. A possible explanation for this result is that the GLOW model lacks data of these attributes, and therefore creates an attribute vector which does not correspond to the desired attribute vector. The visualizations of \texttt{Race=Indian}, \texttt{Big\_Lips} and \texttt{Narrow\_Eyes} correspond with our expectations.

% which are the output of the sensitive attributes \textit{Chubby} and \textit{Bald}. Both of these visualizations look similar, and a shift in multiple different attributes can be observed. It seems that the GLOW model alters these images on three different attributes at the same time, namely: \textit{Male}, \textit{Bald} and \textit{Chubby}. A possible explanation for this result is that the GLOW model lacks data of these attributes, and therefore creates an attribute vector which does not correspond to the desired attribute vector. The visualization of \textit{Race=Indian}, \textit{Big\_Lips} and \textit{Narrow\_Eyes} correspond with our expectations.