Lost in Localization: Building RabakBench with Human-in-the-Loop Validation to Expose Multilingual Safety Gaps
Keywords: AI Safety, Benchmark, Multilingual, Multicultural, Low-Resource
TL;DR: 'RabakBench' is a new, localized multilingual safety benchmark (Singlish, Chinese, Malay, Tamil) for Singapore, and presented as a case study using a scalable benchmark creation methdology for low-resource languages.
Abstract: Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore's unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Critically, while leveraging LLMs for scalability, our pipeline incorporates rigorous human oversight at every stage, with Cohen's kappa scores of 0.68-0.72 demonstrating strong human-model agreement. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.
Submission Number: 6
Loading