BiasEdit: Debiasing Stereotyped Language Models via Model Editing

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: debias, large language model, social bias
Abstract: Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting, often fail to efficiently eliminate bias or directly alter the models' biased internal representations. To address these issues, we propose **BiasEdit**, an efficient debiasing technique via model editing. BiasEdit employs a *debiasing loss* $\mathcal{L}_d = \text{KL}(P_{\theta_{\tilde{\mathcal{W}}}}(x_{\text{stereo}}) \| P_{\theta_{\tilde{\mathcal{W}}}}(x_{\text{anti}})) + \text{KL}(P_{\theta_{\tilde{\mathcal{W}}}}(x_{\text{anti}}) \| P_{\theta_{\tilde{\mathcal{W}}}}(x_{\text{stereo}}))$ guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a *retention loss* $\mathcal{L}_r = \text{KL}(P_{\theta_{\mathcal{W}}}(x_{\text{mless}}) \| P_{\theta_{\tilde{\mathcal{W}}}}(x_{\text{mless}}) )$. Experiments on [StereoSet](https://aclanthology.org/2021.acl-long.416/) and [Crows-Pairs](https://aclanthology.org/2020.emnlp-main.154/) demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangential debiasing baselines, and little to no impact on the language models' general capabilities.
Submission Number: 155
Loading