Keywords: bias auditing, adversarial NLP, counterfactual NLP
Abstract: Current measurements of stereotype/group bias of language models do not take into account the prediction variability stemming from the lack of robustness in these models. Starting from a recently proposed bias auditing benchmark for natural language inference (NLI) systems, we demonstrate that slight lexical variations with unchanged semantics can lead to different predictions and to different bias scores. We generate adversarial samples by employing masked language models to suggest lexical variations for the original hypotheses included in the benchmark. By using these samples, we audit for bias several state-of-the-art language models fine-tuned for NLI tasks and demonstrate that the lack of robustness of these models influences bias measurements. In an attempt to account for this issue, we suggest a new metric for measuring bias that takes into account the disparate prediction outcomes for counterfactual samples, where only the targeted subpopulation differs. To achieve this, we build a counterfactual-based dataset and compare the new measure of bias with previous proposals.We publicly release these two datasets to inspire research on the robustness-bias interplay and better metrics for bias auditing.
Paper Type: long
Research Area: Ethics, Bias, and Fairness
0 Replies
Loading