No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

Shireen Chand, Faith Baca, Emilio Ferrara

Published: 13 Jan 2026, Last Modified: 21 Jan 2026AIEveryoneRevisionsCC BY-SA 4.0
Abstract: Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful outputs. While various techniques aim to mitigate these biases, their effects are typically evaluated only along the targeted dimension, leaving cross-dimensional consequences unexplored. This work provides the first systematic quantification of cross-category spillover effects in LLM bias mitigation. We evaluate four bias mitigation techniques (Logit Steering, Activation Patching, BiasEdit, Prompt Debiasing) across ten models from seven families, measuring impact on racial, religious, profession-, and gender-related biases using the StereoSet benchmark. Across 160 experiments yielding 640 evaluations, we find that targeted interventions cause collateral degradations to model coherence and performance along debiasing objectives in 31.5% of untargeted dimension evaluations. These findings provide empirical evidence that debiasing improvements along one dimension can come at the cost of degradation in others. We introduce a multi-dimensional auditing framework and demonstrate that single-target evaluations mask potentially severe spillover effects, underscoring the need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.
Loading