Improving Mitigation of Language Model Stereotypes via Reinforcement LearningDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We propose REFINE-LM, a novel architecture, based on Reinforcement Learning, designed to mitigate unintended bias in pre-trained masked language models.
Abstract: Widespread adoption of applications powered by large language models such as BERT and GPT highlights concerns within the community about the impact of unintended bias that such models can inherit from training data. For example, past work reports evidence of LLMs that proliferate gender stereotypes, as well as geographical and racial bias. Previous approaches have focused on data pre-processing techniques or techniques that attempt to debias embeddings directly with substantial disadvantages in terms of increased resource requirements, annotation efforts as well as limitations in terms of applicability to a sufficient range of bias types. In this paper, we propose REFINE-LM, a post-hoc filtering of bias using Reinforcement learning that is model architecture as well as bias-type agnostic. Experiments across a range of models, including DistillBERT, BERT and RoBERTa, show the proposed method to (i) substantially reduce stereotypical bias while preserving language model performance; (ii) achieve applicability to a wide range of bias types, generalizing across contexts such as gender, ethnicity, religion, and nationality-based biases; (iii) a reduction in required training resources.
Paper Type: long
Research Area: Ethics, Bias, and Fairness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview