Your Fairness May Vary: Pretrained Language Model Fairness in Toxic Text Classification

Ioana Baldini; Dennis Wei; Karthikeyan Natesan Ramamurthy; Mikhail Yurochkin; Moninder Singh

Your Fairness May Vary: Pretrained Language Model Fairness in Toxic Text Classification

Ioana Baldini, Dennis Wei, Karthikeyan Natesan Ramamurthy, Mikhail Yurochkin, Moninder Singh

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: group fairness, language models, toxic text classification

Abstract: Warning: This paper contains samples of offensive text. The popularity of pretrained language models in natural language processing systems calls for a careful evaluation of such models in down-stream tasks, which have a higher potential for societal impact. The evaluation of such systems usually focuses on accuracy measures. Our findings in this paper call for fairness measures to also be considered. Through the analysis of more than a dozen pretrained language models of varying sizes on two toxic text classification tasks, we demonstrate that focusing on accuracy measures alone can lead to models with wide variation in fairness characteristics. Specifically, we observe that fairness can vary even more than accuracy with increasing training data size and different random initializations. At the same time, we find that little of the fairness variation is explained by model size/compression, despite claims in the literature. To improve model fairness without retraining, we show that two post-processing methods developed for structured, tabular data can be successfully applied to a range of pretrained language models.

One-sentence Summary: Characterization study of the performance and group fairness of language models in text toxicity classification

7 Replies

Loading