JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs

JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs

ACL ARR 2025 May Submission7866 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Social biases reflected in language are inherently shaped by cultural norms, which vary significantly across regions, leading to diverse manifestations of stereotypes. However, social bias evaluation for large language models (LLMs) in non-English contexts often relies on translations of English benchmarks that fail to reflect Japanese cultural norms. In this work, we introduce JUBAKU (Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation), an adversarially constructed benchmark tailored to Japanese cultural contexts, considering ten distinct cultural categories. Unlike existing benchmarks, JUBAKU features dialogue scenarios hand-crafted by Japanese annotators designed to trigger and expose latent social biases in Japanese LLMs. We evaluated nine Japanese LLMs on JUBAKU and three others adapted from English benchmarks. All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks. Human annotators achieved 91% accuracy in identifying unbiased responses, confirming JUBAKU’s reliability and its adversarial nature to LLMs. These results highlight the value of our adversarial data design for uncovering latent social bias not captured by existing benchmarks in LLMs.

Paper Type: Short

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: multilingual benchmarks, multilingual evaluation, model bias/fairness evaluation, stereotype, social bias

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: Japanese

Submission Number: 7866

Loading