Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

Anonymous

Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

TL;DR: Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

Abstract: Instruction-following language models often show undesirable biases. These undesirable biases are accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define "bias neuron", which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an attribution, an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (e.g., three neurons).

Paper Type: long

Research Area: Ethics, Bias, and Fairness

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

0 Replies

Loading