Sensing and Steering Stereotypes: Extracting and Applying Gender Representation Vectors in LLMs

ACL ARR 2025 February Submission7393 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Previous work has proposed various strategies to mitigate potential harms that may posed by these biases. Yet, most work studies biases in LLMs as a black-box problem, with less attention given to understanding how these biases arise from the model's internal mechanisms. In this work, we utilize techniques from representation engineering to study how the concept of "gender" is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for manipulating model outputs related to the concept. Additionally, we present a projection-based method that allows more precise steering of model predictions. We demonstrate its application in identifying gender representations for mitigating gender bias in LLMs. We show that our method produces steering vectors that better reflect the concept learned by the model than the prevailing approach, difference-in-means. Moreover, we demonstrate how the steering vectors can be used to reduce gender biases in model outputs.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: bias evaluation, bias mitigation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 7393
Loading