“Flex Tape Can’t Fix That”: Bias and Misinformation in Edited Language ModelsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: New benchmark dataset and first systematic investigation into demographic biases and qualitative harms of model weight editing.
Abstract: Information is generated and edited at a rate that cannot keep up with the time and compute resources necessary to retrain large language models. As such, model editing has emerged as a cheaper and less time-consuming strategy to update knowledge stored in language models. However, model editing can have unintended consequences, both on information that is supposed to remain the same and on the general behavior of language models. This work introduces Seesaw-CF, a novel benchmark dataset for measuring bias-related harms of model editing. Using Seesaw-CF, we conduct the first in-depth investigation of the pitfalls of the Constrained Fine-Tuning, MEND, and MEMIT model editing methods. We focus on biases with respect to demographic groups such as race and gender and qualitative flaws in long-form texts generated by edited language models. We preliminarily find that editing model weights makes GPT-J less confident in its knowledge about entities from Asian and African countries and that factual edits may amplify sexism and xenophobia.
Paper Type: long
Research Area: Ethics, Bias, and Fairness
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
0 Replies

Loading