Badder Seeds: Reproducing the Evaluation of Lexical Methodsfor Bias Measurement

Jille Van der Togt; Lea Tiyavorabun; Matteo Rosati; Giulio Starace

Badder Seeds: Reproducing the Evaluation of Lexical Methodsfor Bias Measurement

Jille Van der Togt, Lea Tiyavorabun, Matteo Rosati, Giulio Starace

Published: 11 Apr 2022, Last Modified: 06 Jul 2025RC2021Readers: Everyone

Keywords: bias, nlp, ai, embeddings, reproducibility, seeds

TL;DR: We reproduce the paper Bad seeds: Evaluating lexical methods for bias measurement

Abstract: Scope of Reproducibility Combating bias in NLP requires bias measurement. Bias measurement is almost always achieved by using exicons of seed terms, i.e. sets of words specifying stereotypes or dimensions of interest. This reproducibility study focuses on Antoniak and Mimno (2021)'s main claim that the rationale for the construction of these lexicons needs thorough checking before usage, as the seeds used for bias measurement can themselves exhibit biases. The study aims to evaluate the reproducibility of the quantitative and qualitative results presented in the paper and the conclusions drawn thereof. Methodology We re-implement the entirety of the approaches outlined in the original paper. We train a skip-gram word2vec model with negative sampling to obtain embeddings for four corpora. This does not require particular computing requirements beyond standard consumer personal computers. Additional code details can be found in our linked repository. Results We reproduce most of the results supporting the original authors' general claim: seed sets often suffer from biases that affect their performance as a baseline for bias metrics. Generally, our results mirror the original paper's. They are slightly different on select occasions, but not in ways that undermine the paper's general intent to show the fragility of seed sets. What was difficult The significant difficulties encountered were due to a lack of publicly available code and documentation to clarify missing information in the paper. For this reason, many algorithms that ultimately turned out to be quite simple required lengthy clarifications with authors or trial and error. Lastly, the research was quite data-intensive, which caused some implementations to be non-trivial to account for memory management. What was easy Once understood, the methods proposed by the authors were relatively easy to implement. The mathematics involved is quite straightforward. Communication was also reasonably accessible. The authors' emails were readily available, and the responses came quickly and were always helpful. Communication with original authors We maintained a lengthy email correspondence throughout the replication of the paper with one author, Maria Antoniak. We contacted her to clarify extensive aspects of the paper's methodology. Specifically, this concerned summarizing the data processing approach, explaining missing hyperparameters, and outlining the aggregation of metrics across different bootstrapped models. None of the original code was disclosed.

Paper Url: https://aclanthology.org/2021.acl-long.148/

Paper Venue: ACL 2021

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 5 code implementations](https://www.catalyzex.com/paper/badder-seeds-reproducing-the-evaluation-of/code)

4 Replies

Loading