[Re] Assessing the Reliability of Word Embedding Gender Bias Measures

Anonymous

[Re] Assessing the Reliability of Word Embedding Gender Bias Measures

Anonymous

05 Feb 2022 (modified: 05 May 2023)ML Reproducibility Challenge 2021 Fall Blind SubmissionReaders: Everyone

Keywords: Reproducibility, NLP, Reliability, Consistency, Bias measures, Gender bias

TL;DR: An attempt to reproduce and replicate research on the consistency and reliability of gender bias measures in word embedding models.

Abstract: Scope of Reproducibility This work attempts to reproduce the results of the paper 'Assessing the Reliability of Word Embedding Gender Bias Measures' by Du et al. (2021). In this paper, the authors test to which extent gender bias measures are consistent and reliable in popular word embedding models. The main claims of the original paper with regard to word embedding gender bias measures are: 1. High test-retest reliability 2. High internal consistency 3. Low inter-rater consistency It is important to verify results claimed by studies, in order to preserve the integrity of scientific research. Therefore, the scientific community has encouraged reproducing papers in order to make researchers more aware about the reproducibility of their future work. We support this movement by contributing through reproducing the work done by Du et al. (2021). Methodology We used the author's code to attempt reproducing the results. Furthermore we investigated whether the evaluation framework proposed by the authors would also be applicable to other forms of bias. Therefore we altered the code to assess the reliability of measuring sexual orientation bias in word embeddings. The experiments were run on a machine with a Intel i7-8700 CPU and a machine with a Bronze 3104 CPU. The total running time (sequentially) was roughly 150 hours. Results The reproduced results mainly agree with the claims made by the original paper. However, we found that the variance of the test-retest reliability scores depends on the batch of random seeds used. Therefore we suggest that more random seeds are needed to support the first claim made by the original paper, which is high test-retest reliability. What was easy The paper was very well-documented. It was clear what they wanted to test, how they tested it, and what conclusions they could draw from it. What was difficult However, this clear documentation of the paper was not entirely reflected in their code. There were many bugs in the code and the README file did not contain all the information to successfully reproduce the experiments. Anonymous Github URL: https://anonymous.4open.science/r/MLRC-2021-CD10/

Paper Url: https://aclanthology.org/2021.emnlp-main.785.pdf

Paper Venue: ACL 2021

4 Replies

Loading