Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG
Keywords: fairness, stereotype, deductive stereotype
TL;DR: We characterize LLMs' deductive stereotyping and propose fair-GCG to mitigate it.
Abstract: Warning: This paper contains several toxic and offensive statements.
While reasoning generally improves fairness in recent large language models (LLMs), failures persist. In this work, we identify a dominant failure mode, deductive stereotyping, in which models apply population-level statistical regularities to individual cases, producing logically coherent yet socially biased inferences. We provide a statistical interpretation of this phenomenon. To steer models toward fairness-aware reasoning, we propose a reasoning-time injection framework. We further introduce Fair-GCG to systematically discover effective injection phrases. Injection phrases discovered by Fair-GCG improve performance across multiple fairness benchmarks, generalize from smaller to larger LLMs, and transfer to real-world fairness-sensitive tasks.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 1
Loading