Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models

Rishabh Adiga; Besmira Nushi; Varun Chandrasekaran

Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models

Rishabh Adiga, Besmira Nushi, Varun Chandrasekaran

26 Sept 2024 (modified: 13 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bias, Attention, LLMs

TL;DR: This paper introduces ATLAS, a method for identifying and mitigating bias in large language models (LLMs) by localizing biased attention layers and applying targeted interventions.

Abstract: We explore the internal mechanisms of how bias emerges in large language models (LLMs) when provided with ambiguous comparative prompts: inputs that compare or enforce choosing between two or more entities without providing clear context for preference. Most approaches for bias mitigation focus on either post-hoc analysis or data augmentation. However, these are transient solutions, without addressing the root cause: the model itself. Numerous prior works show the influence of the attention module towards steering generations. We believe that analyzing attention is also crucial for understanding bias, as it provides insight into how the LLM distributes its focus across different entities and how this contributes to biased decisions. To this end, we first introduce a metric to quantify the LLM's preference for one entity over another. We then propose $ATLAS$ (Attention-based Targeted Layer Analysis and Scaling), a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers. To evaluate our method, we conduct experiments across 3 datasets (BBQ, Crows-Pairs, and WinoGender) using $GPT$-$2$ $XL$ (1.5B), $GPT$-$J$ (6B), $LLaMA$-$2$ (7B) and $LLaMA$-$3$ (8B). Our experiments demonstrate that bias is concentrated in the later layers, typically around the last third. We also show how $ATLAS$ effectively mitigates bias through targeted interventions without compromising downstream performance and an average increase of only 0.34\% in perplexity when the intervention is applied. We see an average improvement of 0.28 points in the bias score across all the datasets.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7766

Loading