Sensing and Steering Stereotypes: Extracting and Applying Gender Representation Vectors in LLMs

Sensing and Steering Stereotypes: Extracting and Applying Gender Representation Vectors in LLMs

ACL ARR 2025 May Submission5738 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate these biases, but most work studies biases as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of "gender" is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model's representation. We develop a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: bias evaluation, bias mitigation, probing

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 5738

Loading