Keywords: steering, bias, interpretability, language models
TL;DR: By projecting LM activations onto vectors that encode properties such as expertise, we find causal evidence of latent biases that would not have been detectable in model outputs.
Abstract: Language models (LMs) capture meaningful structure, but also often learn spurious correlations. Spurious correlations include demographic biases, where a model associates demographic groups with properties to which they are not causally attached. Post-training methods have reduced bias in models' outputs, but may not necessarily address the internal mechanisms that cause bias to arise; this could cause unpredictable failure modes on future inputs. To investigate whether LMs encode internal biases, we derive steering vectors associated with various positive and negative properties. We verify that these vectors have predictable impacts on model behavior. Then, in a question answering task, we project the activations of hidden layers onto these vectors; findings from this method show that properties such as expertise or reliability are counterfactually dependent on demographic information. However, behavioral proxies of these variables show no relationship with demographic information. Finally, we demonstrate that these vectors have little impact in new task settings, such as a hiring task. This underscores the need to validate the findings of interpretability methods in out-of-distribution settings: the same bias phenomenon may be encoded in different subspaces, depending on the task setting.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22699
Loading