Abstract: This work investigates the concentration of demo- graphic signals in high-dimensional embeddings, focusing on a “bias subspace” that encodes sensitive at- tributes such as gender. Experiments on textual job biographies reveal that a single vector—derived by subtracting subgroup means—can correlate with gender above 0.95, indicating that only a few coordinates often capture dominant group distinctions. A further analysis using covariance differences isolates additional, though weaker, bias directions. To explain why neutralizing the principal bias dimension barely impairs classification performance, this paper introduces a Bounded Degradation Theorem. The result shows that unless a downstream classifier aligns heavily with the removed axis, any resulting logit shifts remain bounded, thus preserving accuracy. Empirical observations confirm that group-level outcomes shift, yet overall accuracy remains nearly unchanged. Theoretical and experimental insights highlight both the geo- metric underpinnings of bias in language-model embeddings and practical strategies for mitigating undesired effects, while leaving most classification power intact.
External IDs:dblp:conf/flairs/AminNJ25
Loading