Bias Spillover in Language Models: A Review of Political Alignment, Regional Fragility, and Multi-Axis Risks
Abstract: Efforts to mitigate social bias in large language models (LLMs) often target dimensions such as gender or political ideology in isolation. Yet interventions along one axis frequently propagate to others, a phenomenon we term \textit{bias spillover}. This paper reviews over 80 studies, synthesizing empirical and theoretical evidence of cross-axis interference in model behavior. We define bias spillover as the unintended alteration of behavior on one social axis when mitigating another, driven by representational entanglement, competing fine-tuning objectives, and structural fairness trade-offs. These effects align with well-known optimization pathologies such as Goodhart’s Law, reward hacking, task interference, and impossibility results in algorithmic fairness, highlighting spillover as a fundamental, not incidental, challenge. We document observed spillover cases, for instance, political fine-tuning shifting emotional tone and moral framing, or gender balancing distorting age distributions and identify blind spots in current audits, including poor coverage of multi-axis and non-Western contexts. We conclude by introducing a typology of auditing frameworks and recommending mitigation strategies that explicitly account for entangled social representations, moving beyond isolated fairness metrics toward spillover-aware evaluation of LLMs.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=GWFWh1arNg
Changes Since Last Submission: revision based on the reviewer feedback
Assigned Action Editor: ~Binhang_Yuan1
Submission Number: 5210
Loading