The Selective Safety Trap: How LLMs Scaling and Alignment Fail to Generalize Across Minority Demographics
Keywords: Interpretability, Hate Speech, LLM
TL;DR: Show that increases the model make it less fair in generating hate speech across minorities
Abstract: We challenge the prevailing assumption that large language models (LLMs) safety alignment generalizes as a semantic capability across protected groups. By conducting a controlled adversarial stress test with 44,000 prompts balanced across 16 demographics and two languages, we demonstrate that current models rely on selective memorization rather than universal principles. We identify a rigid "two-tiered" safety hierarchy: models robustly protect high-visibility groups in US discourse (e.g., LGBTQIA+) while systematically neglecting marginalized communities (e.g., disabilities), with defense rates varying by up to 33\% for identical attack vectors. Crucially, we report an Inverse Scaling of Equity: contrary to standard scaling laws, increasing model parameters exacerbates these disparities, linearly increasing the variance in safety performance between groups. These findings suggest that current alignment techniques incentivize the overfitting of dominant safety priors, where scaling functions as a bias magnifier rather than a solution to robustness.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 79
Loading