Causal Abstraction Finds Universal Representation of Race in Large Language Models

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, causal abstraction, bias and fairness
Abstract:

While there is growing interest in the potential bias of large language models (LLMs), especially in high-stakes decision making, it remains an open question how LLMs mechanistically encode such bias. We use causal abstraction (Geiger et al., 2023) to study how models use the race information in two high-stakes decision settings: college admissions and hiring. We find that Alpaca 7B, Mistral 7B, and Gemma 2B check for an applicants’ race and apply different preferential or discriminatory decision boundaries. The race subspace found by distributed alignment search generalizes across different tasks with average interchange intervention accuracies from 78.09% to 88.64% across the three models. We also propose a novel RaceQA task, where the model is asked to guess an applicant’s race from the name in their profile, to further probe the mechanism of the bias. We show that patching in a different race representation changes the model’s perception of the applicant’s race 99.80% of the time for Alpaca and 98.20% of the time for Mistral. Overall, our work provides evidence for a universal mechanism of racial bias in LLMs’ decision-making.

Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8875
Loading