[RE] Bad Seeds: Evaluating Lexical Methods for Bias MeasurementDownload PDF

Anonymous

05 Feb 2022 (modified: 05 May 2023)ML Reproducibility Challenge 2021 Fall Blind SubmissionReaders: Everyone
Keywords: Seeds, NLP, Bias, Embedding, PCA, WEAT
TL;DR: We attempt to reproduce results that claim that seed sets, used for almost all bias quantification algorithms, are biased themselves, and are only able to verify one out of three claims made in the original work.
Abstract: In this work we verify the results of Bad Seeds: Evaluating Lexical Methods for Bias Measurement. We replicate the experiments conducted and verify the main claims made in the original paper: (1) Bias measurements depend on seeds and models. (2) Shuffled seed pairs can result in a significant different bias subspace compared to ordered seed pairs. (3) Set similarity is negatively correlated with the explained variance of the first PCA component in the seed pairings subspace. Methodology We used skip-gram with negative sampling to train word2vec models with the same hyperparameters and data. We implemented code for the experiments using the resulting word embeddings and the seed sets provided by the authors. Results Overall, only one claim was reproduced. We reproduced the claim that bias measurements is dependent on the choice of seed set. We were not able to adequately reproduce the claims that shuffled pairs of seed sets generally result in less clearly defined correlation and that for pair of seed sets set similarity is negatively correlated with the explained variance of the first principal component. What was easy The paper is easy to follow. The data was publicly available. Authors replied frequently providing details about the parameters and preprocessing steps. Also authors were open for the discussion regarding seeds on the Github repository of the project. What was difficult In certain cases, the gathered seed sets json file contained errors. Specifically: 'daughters' was misspelled as 'daughers' (has now been updated). 'ma', 'am' was used as two words instead of one word "ma'am". The seed words for figure 4 as stated in the appendix are not the same as the one in the image itself. Table 2 from the original paper was also difficult to reproduce as the preprocessing according to the authors' description gave close but not equal results for the NYT dataset and significantly different results for two other datasets. Reproduced numbers are presented in table. Because of the time constraints, 20 bootstrapped launches were not conducted. Contact was made with the original authors on multiple occasions to ask clarification questions regarding the implementation of the experiments.
Paper Url: https://aclanthology.org/2021.acl-long.148.pdf
Paper Venue: ACL 2021
4 Replies

Loading