Keywords: Gender Bias, Word Embeddings
Abstract: Scope of Reproducibility
The authors claim that the frequency of the words in the training corpus contributes to gender bias in the embeddings. Removing this frequency component from embeddings along with neutralizing the gender component yields gender debiased embeddings with new benchmarks on gender bias quantifying metrics.
Methodology
We use the author’s code and verify the algorithm provided in the paper for consistency. The double-hard debias algorithm is a post-training algorithm. After applying this algorithm, we test the results on the different datasets used by the authors to benchmark it. We use the free google colab to run these experiments. We add comments and rename variables to improve the readability of the code in our release {https://anonymous.4open.science/r/74f2e710-e657-474d-a40b-e89af2790c57/}.
Results
The authors use two sets of evaluations to prove the efficacy of their algorithm. First, they use neighborhood metric, WEAT, and co-reference resolution task to quantify the gender bias in embeddings. We were not able to reproduce the latter task of co-reference resolution owing to the difficulty in the readability of the code. Moreover, we report that the neighborhood metric test is not reproducible with the information provided by the authors in their paper and code. We try to reproduce this by filling in our own assumptions but get drastically different results. Second, they test their word embedding quality on existing benchmarking tasks - word analogy and concept categorization. This part is reproducible to within 0.5% of the reported value.
What was easy
The author’s code readability is low, which we modify in our implementation. Other than that, the code is provided in form of notebooks that run on the latest versions of all libraries. We run these notebooks on the free google colab, making it economically feasible to reproduce. So code and results are essentially easy to re-implement.
What was difficult
It was difficult to map the algorithm provided in the paper to the code implementation due to poor code writing standards. The neighborhood metric is difficult to implement as authors do not provide a random state which in turn is varying the results. The list of constants should be added separately to ease the running of various experiments. Moreover, we weren’t able to reproduce the co-reference resolution test for measuring bias in embedding. The code provided by the authors for this experiment is difficult to understand and execute.
Communication with original authors
We did not have any communication with the original authors.
Paper Url: https://openreview.net/forum?id=rcUCn8uqj35&referrer=%5BML%20Reproducibility%20Challenge%202020%5D(%2Fgroup%3Fid%3DML_Reproducibility_Challenge%2F2020)
Supplementary Material: zip
4 Replies
Loading