Understanding and Mitigating Gender Bias in LLMs via Interpretable Model Editing

20 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, gender bias, mechanistic interpretability, model editing
TL;DR: We propose the neuron-level interpretable model editing method to reduce the gender bias in LLM, without hurting the LLM's existing abilities.
Abstract: Large language models (LLMs) have achieved great success in various tasks. While LLMs can learn powerful capabilities from large datasets, they also inherit the gender bias present in that data. Existing studies usually propose methods to reduce bias by data cleaning and model retraining/fine-tuning. Although these methods have shown some success, the cost of designing data and retraining/fine-tuning an LLM increases significantly as the model size grows larger. Furthermore, a lack of understanding of the mechanisms behind gender bias prevents researchers from effectively tailoring solutions to address it. In this paper, we utilize mechanistic interpretability methods to construct the neuron circuits for gender bias cases and locate the important neurons storing gender bias. Then we propose the Interpretable Model Editing (Interpret-ME) method to reduce gender bias without designing huge datasets or fine-tuning. Compared to fine-tuning methods, our approach shows competitive results in reducing gender bias across experiments with 8 LLMs. At the same time, our method does not affect the performance in other tasks. Overall, our analysis is useful for understanding the mechanism of gender bias and our method paves a potential way for reducing bias.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2115
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview