Abstract: Social biases reflected in language are inherently shaped by cultural norms, which vary significantly across regions, leading to diverse manifestations of stereotypes. However, social bias evaluation for large language models (LLMs) in non-English contexts often relies on translations of English benchmarks that fail to reflect Japanese cultural norms. In this work, we introduce JUBAKU (Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation), an adversarially constructed benchmark tailored to Japanese cultural contexts, considering ten distinct cultural categories. Unlike existing benchmarks, JUBAKU features dialogue scenarios hand-crafted by Japanese annotators designed to trigger and expose latent social biases in Japanese LLMs. We evaluated nine Japanese LLMs on JUBAKU and three others adapted from English benchmarks. All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks. Human annotators achieved 91% accuracy in identifying unbiased responses, confirming JUBAKU’s reliability and its adversarial nature to LLMs. These results highlight the value of our adversarial data design for uncovering latent social bias not captured by existing benchmarks in LLMs.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingual benchmarks, multilingual evaluation, model bias/fairness evaluation, stereotype, social bias
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: Japanese
Submission Number: 7866
Loading