BEYOND SINGLE-AXIS FAIRNESS: LEARNING TO DETECT INTERSECTIONAL BIASES

ICLR 2026 Conference Submission18857 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Intersectional Bias, Retriever, Actor-critic, bias detection, bias mitigation
Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes domains, yet they often inherit intersectional biases, prejudices that emerge not from a single axis such as race or gender, but from their intersections (e.g., “Black women are too aggressive for leadership”). Existing bias detection and mitigation methods predominantly address single-axis biases and fail to generalize to their complex interactions. In this paper, we present the first unified framework for detecting and mitigating intersectional bias. We construct two paragraph-level intersectional bias dataset: \texttt{Indic-Intersect} and \texttt{Western-Intersect}, aligned to Indian and Western sociocultural contexts, respectively. For detection, we introduce \textbf{\textit{BiasRetriever}}, a contrastively trained retriever that learns a bias-aware embedding space by pulling biased text close to canonical stereotypes and pushing it away from unbiased or unrelated examples. BiasRetriever achieves up to $10\%$ more Jaccard score over LLM-based classifiers on unseen intersectional categories and maintains robust cross-domain generalization.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18857
Loading