MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking

ACL ARR 2026 January Submission7622 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety Alignment, Jailbreaking, Demographic Bias, Multilingual NLP
Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating "Identity Hate" into scalar scores that mask systemic vulnerabilities against specific populations. To expose this \textit{selective safety}, we introduce \textbf{MiJaBench}, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate \textbf{MiJaBench-Align}, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33\% within the same model solely based on the target group. Crucially, we demonstrate that \textbf{model scaling exacerbates these disparities}, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at \href{https://osf.io/a32mx/overview?view_only=08b918a8048c47c98ba3e22388547505}{Anonymous Repository}.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: red teaming; safety and alignment; scaling;
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: english, portuguese
Submission Number: 7622
Loading