\documentclass{article}

% if you need to pass options to natbib, use, e.g.:
%     \PassOptionsToPackage{numbers, compress}{natbib}
% before loading neurips_2019

% ready for submission
% \usepackage{neurips_2019}

% to compile a preprint version, e.g., for submission to arXiv, add add the
% [preprint] option:
%     \usepackage[preprint]{neurips_2019}

% to compile a camera-ready version, add the [final] option, e.g.:
\usepackage[nonatbib]{neurips_2019}

%%\usepackage{natbib}
% to avoid loading the natbib package, add option nonatbib:
%     \usepackage[nonatbib]{neurips_2019}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{multirow}

\usepackage[dvipsnames]{xcolor}
\usepackage[normalem]{ulem}

\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{caption}

\graphicspath{ {./figures/} }

\newif{\ifhidecomments}

\newcommand{\hd}{Hard Debias}
\newcommand{\doublehd}{Double-Hard Debias}

\newcommand{\glove}{GloVe}
\newcommand{\hdglove}{Hard-GloVe}
\newcommand{\stronghdglove}{Strong Hard-GloVe}
\newcommand{\gnglove}{GN-GloVe}
\newcommand{\gnaglove}{GN-\glove($w_a$)}
\newcommand{\gpglove}{GP-GloVe}
\newcommand{\gpgnglove}{GP-GN-GloVe}
\newcommand{\doublehdglove}{Double-Hard GloVe}


\title{[RE] Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation}

% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to LaTeX to determine where to break the
% lines. Using \AND forces a line break at that point. So, if LaTeX puts 3 of 4
% authors names on the first line, and the last on the second line, try using
% \AND instead of \And before the third author name.


\begin{document}



\maketitle

\section*{\centering Reproducibility Summary}

\subsection*{Scope of Reproducibility}

The authors claim that the frequency of the words in the training corpus contributes to gender bias in the embeddings. Removing this frequency component from embeddings along with neutralizing the gender component yields gender debiased embeddings with new benchmarks on gender bias quantifying metrics.


\subsection*{Methodology}

We use the authors code and verify the algorithm provided in the paper for consistency. The double-hard debias algorithm is a post-training algorithm. After applying this algorithm, we test the results on the different datasets used by the authors to benchmark it. We use the free google colab to run these experiments. We add comments and rename variables to improve the readability of the code in our release \footnote{https://anonymous.4open.science/r/74f2e710-e657-474d-a40b-e89af2790c57/}.


\subsection*{Results}

The authors use two sets of evaluations to prove the efficacy of their algorithm. First, they use neighborhood metric, WEAT, and co-reference resolution task to quantify the gender bias in embeddings. We were not able to reproduce the latter task of co-reference resolution owing to the difficulty in the readability of the code. Moreover, we report that the neighborhood metric test is not reproducible with the information provided by the authors in their paper and code. We try to reproduce this by filling in our own assumptions but get drastically different results. Second, they test their word embedding quality on existing benchmarking tasks - word analogy and concept categorization. This part is reproducible to within 0.5\% of the reported value.

\subsection*{What was easy}

The author's code readability is low, which we modify in our implementation. Other than that, the code is provided in form of notebooks that run on the latest versions of all libraries. We run these notebooks on the free google colab, making it economically feasible to reproduce. So code and results are essentially easy to re-implement. 

\subsection*{What was difficult}

It was difficult to map the algorithm provided in the paper to the code implementation due to poor code writing standards. The neighborhood metric is difficult to implement as authors do not provide a random state which in turn is varying the results. The list of constants should be added separately to ease the running of various experiments. Moreover, we were not able to reproduce the co-reference resolution test for measuring bias in embedding. The code provided by the authors for this experiment is difficult to understand and execute.

\subsection*{Communication with original authors}

We did not have any communication with the original authors.
\newpage
%\textit{\textbf{The following section formatting is \textbf{optional}, you can also %define sections as you deem fit.
%\\
%Focus on what future researchers or practitioners would find useful for reproducing or %building upon the paper you choose.}}
\section{Introduction}

Despite widespread use in natural language processing (NLP) tasks, word embeddings have been criticized for inheriting unintended gender bias from training corpora. \cite{bolukbasi2016man} highlight that in word2vec embeddings trained on the Google News dataset \cite{mikolov2013distributed}, “programmer” is more closely associated with “man” and “homemaker” is more closely associated with “woman”. Such gender bias has also been shown to propagate in downstream tasks. Despite plenty of work in this field, with methods ranging from corpus level modifications to post-training modifications to embeddings, it remains an unsolved problem. With this work, the authors combine two techniques to reduce gender bias in embeddings. First, they argue that the frequency of words in the corpus adds to the bias. And thus use the work of \cite{mu2017all} to remove the frequency component from trained embeddings. Second, they use the hard debias algorithm of \cite{bolukbasi2016man}, to remove the gender direction from the trained embeddings of most biased words. Combining these two techniques, they benchmark the result of their algorithm by showcasing reduction in bias and limited loss of information in the resultant word embeddings.

\section{Scope of reproducibility}
\label{sec:claims}

The authors claim that the frequency of words in the training corpus contributes towards gender bias in the embeddings. Removing this frequency component from embeddings along with neutralizing the gender component yields gender debiased embeddings with new benchmarks.

\begin{itemize}
    \item Claim 1: The double hard debias algorithm reduces gender bias significantly. This is verified on 3 benchmarking datasets described in the section \ref{sec:ds_exp} below. We showcase the outcome of our experiments of these in Table \ref{tab:WEAT} and Table \ref{tab:bias_eval}.
    \item Claim 2: The above post-processing algorithm of gender debiasing doesn't hamper the inherent use-case of word embeddings. This is verified on standard embedding quality measurement techniques described below. We present the results of our experiments on these in Table \ref{tab:analogy_categorization}.
\end{itemize}

Each subsection in section~\ref{sec:results} refers to above claims and talks about the level and ease of reproducibility of above claims and experiments as performed by the authors for these claims.


%\jdcomment{To organizers: I asked my students to connect the main claims and the experiments that supported them. For example, in this list above they could have ``Claim 1, which is supported by Experiment 1 in Figure 1.'' The benefit was that this caused the students to think about what their experiments were showing (as opposed to blindly rerunning each experiment and not considering how it fit into the overall story), but honestly it seemed hard for the students to understand what I was asking for.}

\section{Methodology}
We use the authors code to ease our understanding of the experiments and to reproduce the claims presented by the author. We used google colab for re-running these experiments. For complete understanding of the algorithm, we used the mixture of paper and code.  

\subsection{Model descriptions}

The authors introduce the double hard debias algorithm in this paper. This is a post-training algorithm that works after the embeddings have been trained to reduce the gender bias in those embeddings. Hence, this algorithm requires no parameters to train. First, the frequency information from these embeddings is removed. This is done by calculating the first \emph{k} principal components of the trained embeddings. The value of \emph{k} is empirically determined. These projections of embeddings along these \emph{k} components are then removed from the embeddings. Second, the gender direction is determined by averaging the difference of 10 gender pair words. Then the projection of embeddings along this gender direction is removed. The double hard debias is now done.

\begin{table*}[t]
\centering
\small
\begin{tabular}{ccccccc}
\toprule
\multirow{2}{*}{\textbf{Embeddings}} & \multicolumn{2}{c}{\bf Career \& Family } & \multicolumn{2}{c}{\bf Math \& Arts} & \multicolumn{2}{c}{\bf Science \& Arts}\\
 & \bf $d$ & \bf $p$ &  \bf $d$ & \bf $p$ & \bf $d$ & \bf $p$\\
\midrule
\glove & $1.81$ & $0.0$& $0.55$ & $0.14$ & $0.88$& $0.04$\\
\midrule
\gnglove & $1.82$ & $0.0$& $1.21$ & $6\mathrm{e}^{-3}$ & $1.02$& $0.02$\\
\gnaglove & $1.76$ & $0.0$& $1.43$ & $1\mathrm{e}^{-3}$ & $1.02$& $0.02$\\
\midrule
\gpglove & $1.81$ & $0.0$& $0.87$ & $0.04$ & $0.91$& $0.03$\\
\gpgnglove & $1.80$ & $0.0$& $1.42$ & $1\mathrm{e}^{-3}$ & $1.04$& $0.01$\\
\midrule
\hdglove & $1.55$ & $2\mathrm{e}^{-4}$& $0.07$ & $0.44$ & $ 0.16$& $0.62$\\
\stronghdglove & $1.55$ & $2\mathrm{e}^{-4}$& $0.07$ & $0.44$ & $ 0.16$& $0.62$\\
\midrule
\doublehdglove & $1.53$ & $2\mathrm{e}^{-4}$& $0.09$ & $0.57$ & $ 0.15$& $0.61$\\
\bottomrule
\end{tabular}
\caption{WEAT test of embeddings before/after Debiasing. The bias is insignificant when p-value, $p > 0.05$. Lower effective size ($d$) indicates less gender bias. Significant gender bias related to Career \& Family and Science \& Arts words is effectively reduced by \doublehdglove. Note for Math \& Arts words, gender bias is insignificant in original \glove.}
\label{tab:WEAT}
\end{table*}


\subsection{Datasets and Experimental Setup}
\label{sec:ds_exp}
The authors perform two sets of experiments to highlight the efficacy of their approach. In the first set, they prove the reduction in gender bias through 3 methods: co-reference resolution via the \cite{zhao2018gender} and the OntoNotes 5.0 dataset, the WEAT, the NeighbourHood Metric. 
    \begin{itemize}
        \item \textbf{Co-reference Resolution}: Coreference resolution aims at identifying noun phrases referring to the same entity. \cite{zhao2018gender} identified gender bias in modern coreference systems, e.g. “doctor” is prone to be linked to “he” and also created a new WINO bias dataset to quantify the bias in word embeddings.
        \item \textbf{WEAT}: The Word Embeddings Association Test is a permutation test used to measure bias. The authors consider male names and females names as attribute sets and compute the differential association of two sets of target words as used in \cite{caliskan2017semantics} and the gender attribute sets. 
        \item \textbf{Neighbourhood Metric}: Introduced by \cite{gonen2019lipstick}, this is a metric to measure bias by clustering. The authors take the top k most biased words according to their cosine similarity with gender direction in the original GloVe \cite{pennington2014glove} embedding space. They then run k-Means to cluster them into two clusters and compute the alignment accuracy with respect to gender, results are presented in Table \ref{tab:bias_eval}. The lower the accuracy, the less the gender bias in the embeddings.
    \end{itemize}
In the second set, the authors prove the information retention of the embeddings post applying their algorithm. They use two tasks for it: word analogy task and concept categorization task.

    \begin{itemize}
        \item \textbf{Word Analogy}: Given three words A, B and C, the analogy task is to find word D such that “A is to B as C is to D”. In the experiments, D is the word that maximize the cosine similarity between D and C - A + B. The authors evaluate all non-debiased and debiased embeddings on the MSR \cite{mikolov2013linguistic} word analogy task, which contains 8000 syntactic questions, and on a second Google word analogy \cite{mikolov2013efficient} dataset that contains 19,544 (Total) questions, including 8,869 semantic (Sem) and 10, 675 syntactic (Syn) questions. 
        \item \textbf{Concept Categorization}: The goal of concept categorization is to cluster a set of words into different categorical subsets. For example, “sandwich” and “hotdog” are both food and “dog” and “cat” are animals. The clustering performance is evaluated in terms of purity \cite{schutze2008introduction} - the fraction of the total number of the words that are correctly classified. Experiments are conducted on four benchmark datasets: the Almuhareb-Poesio (AP) dataset \cite{almuhareb2006attributes}; the ESSLLI 2008 \cite{baroni2008bridging}; the Battig 1969 set \cite{battig1969category} and the BLESS dataset \cite{baroni2011we}.
    \end{itemize}
    
All of the above are standard datasets and evaluation methods which do not require any post-processing and can be directly used for testing any word embedding. Our code used to replicate the above experiments can be found here.\footnote{https://anonymous.4open.science/r/74f2e710-e657-474d-a40b-e89af2790c57/}

\subsection{Computational requirements}

We used the free google colab to run our experiments. Apart from the data download time, all these experiments run within 30 mins on the free google GPU setup. For experimenting with various variants of Glove Embedding, we use the link\footnote{http://www.cs.virginia.edu/~tw8cb/word\_embeddings/} provided by the authors. 

\section{Results}
\label{sec:results}

Barring the two tests in claim 1 that highlight the reduction in gender bias of their method, we were able to reproduce all other results published by the authors and thus were able to fully verify claim 2. 



\begin{table*}[t]
\centering
\small
\begin{tabular}{ccccccc}
\toprule
\multirow{2}{*}{\bf Embeddings} & \multicolumn{2}{c}{\bf Top 100}      & \multicolumn{2}{c}{\bf Top 500}      & \multicolumn{2}{c}{\bf Top 1000}     \\
                            & \textbf{Ours} & \textbf{Authors} & \textbf{Ours} & \textbf{Authors} & \textbf{Ours} & \textbf{Authors} \\
\midrule
\glove                       & $100.0$         & $100.0$            & $100.0$         & $100.0$            & $100.0$         & $100.0$            \\
\midrule
\gnglove                    & $100.0$         & $100.0$            & $100.0$         & $100.0$            & $99.8$          & $99.9$             \\
\gnaglove                & $100.0$         & $100.0$            & $99.5$          & $99.7$             & $89.4$          & $88.5$             \\
\midrule
\gpglove                    & $100.0$         & $100.0$            & $100.0$         & $100.0$            & $100.0$         & $100.0$            \\
\gpgnglove                 & $100.0$         & $100.0$            & $100.0$         & $100.0$            & $100.0$         & $99.4$             \\
\midrule
(Strong) \hdglove            & $76.5$          & $59.0$             & $80.2$          & $62.1$             & $80.2$          & $68.1$             \\
\midrule
\doublehdglove           & $66.5$          & $51.5$             & $74.1$          & $55.5$             & $70.4$          & $59.5$            \\
\bottomrule
\end{tabular}
\caption{Clustering Accuracy (\%) of top 100/500/1000 male and female words. Lower accuracy means less gender cues can be captured. \doublehdglove~ consistently achieves the lowest accuracy.}
\label{tab:bias_eval}
\end{table*}


\subsection{Results reproducing original paper}

\subsubsection{Result 1}

This section verifies the claim 1 of the authors that highlights the reduction of gender bias on 3 metrics. We successfully executed the WEAT test and got results as presented in Table \ref{tab:WEAT} comparable to the ones published by authors. We were not able to reproduce 2 of these. The Neighbourhood Metric calculation is largely not reproducible because of two reasons: 
\begin{enumerate}
    \item Authors do not state whether they have normalized word vectors or not before performing this experiment.
    \item Authors do not provide the random state with which they have initialised the K-means clustering which lead to different results.
\end{enumerate}



\begin{figure}[hp]
    \centering
    \includegraphics[width=8cm]{figures/ours/glove_discover.pdf}
    \caption{Clustering accuracy after projecting out D-th dominating direction and applying Hard Debias. Lower accuracy indicates less bias.}
    \label{fig:glove_discover}
\end{figure}


We try to replicate this using our own set of assumptions but are not able to reproduce the authors claims. We replicate it via following assumptions: 
\begin{enumerate}
    \item We experiment with both normalized and unnormalized vectors, and report the best result that came with unnormalized vectors in Table \ref{tab:bias_eval}.
    \item We experiment with various random states and report the one with best outcome.
    \item We remove frequency feature along the second principal component as this is the one reported by authors in Figure \ref{fig:glove_discover} to have the best performance. Also, there is an unexplained mismatch between the above figure and results posted in Table \ref{tab:bias_eval}. The best score in the above figure is close to 0.7 which is calculated on Top 1000 male and female words, but in the table, authors showcase the best result to be close to 0.59. This mismatch of outcomes is unexplained in the paper or the code.
\end{enumerate}

We add the t-SNE \cite{van2008visualizing} visualization comparison between the ones published by the authors and the ones which we got in Figure \ref{fig:t-sne_comp}. We are unable to reproduce these visualizations as one owing to the challenges and differences in assumptions posted above.

\begin{figure*}
\centering
\begin{subfigure}{}
  \centering
  \includegraphics[width=0.45\linewidth]{figures/authors/a_GloVe_500_4.pdf}
  \includegraphics[width=0.45\linewidth]{figures/ours/GloVe_500_4.pdf}
  \caption{\glove}
  \label{fig:glove}
\end{subfigure}

\begin{subfigure}{}
  \centering
  \includegraphics[width=0.45\linewidth]{figures/authors/a_Hard-GloVe_500_4.pdf}
  \includegraphics[width=0.45\linewidth]{figures/ours/Hard-GloVe_500_4.pdf}
  \caption{Hard-\glove}
  \label{fig:hd_glove}
\end{subfigure}

\begin{subfigure}{}
  \centering
  \includegraphics[width=0.45\linewidth]{figures/authors/a_Double-Hard-GloVe_500_4.pdf}
  \includegraphics[width=0.45\linewidth]{figures/ours/Double-Hard-GloVe_500_4.pdf}
  \caption{\doublehdglove}
  \label{fig:double_hd_glove}
\end{subfigure}
\caption{tSNE visualization of top $500$ most male and female embeddings. On the left is the authors published visualisations and on the right is what we got after during our experiments. In the \doublehdglove~ figures, the authors showcase mixing up of the two clusters showcasing less gender bias, which does not match with our reproduction of the same experiment.}
\label{fig:t-sne_comp}
\end{figure*}

The second result which we were not able to reproduce is the co-reference resolution task. Due to bad readability of the authors code, we were unable to execute this experiment.

\subsubsection{Result 2}

This verifies claim 2 of the authors that the resultant word embeddings retain the semantic and associative information which makes this distributed word embeddings useful for natural language processing tasks. The authors use the Word Analogy task and Concept Categorization task as explained above in \ref{sec:ds_exp}. We were able to reproduce the results published by authors to within 0.5\% accuracy and present the outcomes in Table \ref{tab:analogy_categorization}. 
\begin{table*}[t]
\centering
\small
\begin{tabular}{c|cccc|cccc}
\toprule
\multirow{2}{*}{\bf Embeddings}& \multicolumn{4}{c|}{\bf Analogy} &\multicolumn{4}{c}{\bf Concept Categorization}
\\ 
 & Sem & Syn & Total & MSR & AP & ESSLI & Battig & BLESS\\
\midrule
\glove & $80.5$ & $62.8$ & $70.8$ & $ 54.2$& 
$56.1$& $ 72.7$& $50.0$& $81.0$\\
\midrule
\gnglove & $77.6$ & $61.6$ & $68.9$ & $51.8$& 
$56.9$& $75.0$& $47.6$& $85.0$\\
\gnaglove & $77.7$ & $61.6$ & $68.9$ & $51.9$& 
$56.9$& $72.7$& $50.2$& $82.5$\\
\midrule
\gpglove & $80.6$ & $61.7$ & $70.3$ & $51.3$& 
$56.1$& $72.7$& $ 49.0$& $78.5$\\
\gpgnglove & $77.6$ & $61.7$ & $68.9$ & $51.8$& 
$61.1$& $70.4$& $ 50.9$& $77.5$\\
\midrule
\hdglove  & $80.3$ & $62.7$ & $70.7$ & $54.3$& 
$62.3$& $79.5$& $48.2$& $84.5$\\
\stronghdglove  & $78.9$ & $62.4$ & $69.8$ & $53.9$& 
$62.3$& $79.5$& $50.9$& $84.5$\\
\midrule
\doublehdglove & $ 80.9$ & $61.6$ & $70.4$ & $53.8$& 
$59.6$& $72.7$& $46.7$& $79.5$\\
\midrule
\end{tabular}
\caption{Results of word embeddings on word analogy and concept categorization benchmark datasets. Performance (x100) is measured in accuracy and purity, respectively. On both tasks, there is no significant degradation of performance due to applying the proposed method.}
\label{tab:analogy_categorization}
\end{table*}

\subsection{Results beyond the paper}

In here, we present the qualitative analysis we did to measure the gender bias aspect of the word embeddings. We draw comparison with heavily biased words and their association with gender pair words - he and she. In Table \ref{tab:qualitative_bias}, we present the difference in cosine similarity of a few biased words with respect to 'he' and 'she'. With this we try to showcase that the authors' algorithm has indeed contribute towards reduced gender bias.

\begin{table*}[]
\centering
\small
\begin{tabular}{l|l|l}
\toprule
\textbf{Word} & \textbf{Before} & \textbf{After} \\
\midrule
$doctor$        & $0.013$           & $0.01$           \\
$programmer$    & $0.036$           & $-0.007$         \\
$homemaker$     & $-0.112$          & $0.033$          \\
$nurse$         & $-0.121$          & $0.033$          \\
$worker$        & $-0.007$          & $0.023$          \\
$president$     & $0.083$           & $0.034$          \\
$politician$    & $0.066$           & $0.029$         \\
\bottomrule
\end{tabular}
\caption{Qualitative Analysis for some highly biased words before and after using the double hard debiasing. Negative means that the words are biased towards 'she' and positive means that the words are biased 'he'.}
\label{tab:qualitative_bias}
\end{table*}


\section{Discussion}
The authors present a viable post-training method to reduce gender bias from non-contextual word embeddings. The author uses 3 benchmarks to showcase a reduction in gender bias. However, we were only able to reproduce only 1 of the benchmarks, with different results on the neighborhood metric. 

We were strongly able to reproduce the experiments that validate claim 2 of the paper, which showcases that the paper's double debias algorithm doesn't hamper the useful properties of word embeddings.

\subsection{What was easy}
The authors code for claim 2 and double debias algorithm was easy to run as it was shared in the form of jupyter notebook. The pseudo-code for the algorithm was easy to understand and this made it easier to follow in the give code. The authors structured the claims in the paper very well, which made it easier to match experiments with these claims.

\subsection{What was difficult}
The authors code lacked structure for claim 1 and other sub parts of the paper, and thus it was difficult to follow. For the co-reference resolution task, a sub part of claim 1, we spent a lot of time to execute the reference code but we were still unable to execute the experiment owing to the poor code organization and readability.

\newpage



\bibliographystyle{ieeetr}
\bibliography{rc2020}
\end{document}
