[Reproducibility Report] Double-Hard Debias: Tailoring Word Embeddings for Gender Bias MitigationDownload PDF

Jan 31, 2021 (edited Apr 01, 2021)ML Reproducibility Challenge 2020 Blind SubmissionReaders: Everyone
  • Keywords: Word Embeddings, De-bias, Gender Bias
  • Abstract: Scope of Reproducibility: Our goal was to reproduce the original paper's central claim that projecting away the word-frequency direction(s) in word embeddings improves the debiasing performance of the well-known Hard Debias algorithm. The objectives were to first verify that such word-frequency direction(s) exist and then verify removing these direction(s) decreases bias without significantly affecting the embeddings' utility. Methodology: We were able to use the author's supplied code and raw data as a starting point, though several modifications and additions were inserted into the code to replicate the author's full data pipeline. Specifically, we needed to insert code to export intermediate data files between Jupyter notebooks. In terms of computational resources, we were able to execute all of the code on our local laptop machines after installing specified software dependencies. Results: We were able to reproduce the author's findings that suggest the existence of a word-frequency direction. Specifically, we confirmed that with GloVe embeddings, the second principle component corresponds with the word-frequency direction. However, we were only able to partially show that projecting away the word-frequency direction improved de-biasing. Our de-biased embeddings contained more bias than the embeddings reported in the original paper. A gender classifier trained on our top-100 most biased words reported $66.5\%$ accuracy while the authors reported $51.5\%$. That said, we verified that the de-biased embeddings preserved semantics and matched these results exactly. What was easy: The author's provided much of the code required so re-constructing the data pipeline was not very challenging. What was difficult: The main difficulty involved identifying precisely where our results began to diverge with the authors'. In the end we were able to reproduce the analogy metrics but not the classification accuracy values. Because there was limited logging in the Jupyter notebook it was difficult to determine why our embeddings were more biased. Communication with original authors: We were in communication with Tianlu, the primary author, on several occasions. Tianlu provided timely responses that confirmed our approach matched the description laid out in the paper. In the end, we were not able to collectively determine the cause behind the divergence in results. However, our findings do reproduce the qualitative finding that a word-frequency direction exists.
  • Paper Url: https://openreview.net/forum?id=rcUCn8uqj35&referrer=%5BML%20Reproducibility%20Challenge%202020%5D(%2Fgroup%3Fid%3DML_Reproducibility_Challenge%2F2020)
  • Supplementary Material: zip
4 Replies