BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

Junho Myung; Nayeon Lee; Yi Zhou; Jiho Jin; Rifki Afina Putri; Dimosthenis Antypas; Hsuvas Borkakoty; Eunsu Kim; Carla Perez-Almendros; Abinew Ali Ayele; Victor Gutierrez Basulto; Yazmin Ibanez-Garcia; Hwaran Lee; Shamsuddeen Hassan Muhammad; Kiwoong Park; Anar Sabuhi Rzayev; Nina White; Seid Muhie Yimam; Mohammad Taher Pilehvar; Nedjma Ousidhoum; Jose Camacho-Collados; Alice Oh

BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

Published: 26 Sept 2024, Last Modified: 16 Jan 2025NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY-SA 4.0

Keywords: cross-culture, multilingual, benchmark, cultural nlp

TL;DR: BLEnD is a benchmark with 52.6k question-answer pairs across 16 countries and 13 languages to evaluate LLMs' everyday cultural knowledge, highlighting performance variations based on online cultural presence and language resource levels.

Abstract: Large language models (LLMs) often lack culture-specific everyday knowledge, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are usually limited to a single language or online sources like Wikipedia, which may not reflect the daily habits, customs, and lifestyles of different regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play or the sports they practice in school is not always explicitly written online. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. The benchmark comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We evaluate LLMs in two formats: short-answer questions, and multiple-choice questions. We show that LLMs perform better in cultures that are more present online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. Furthermore, we find that LLMs perform better in their local languages for mid-to-high-resource languages. Interestingly, for languages deemed to be low-resource, LLMs provide better answers in English. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.

Supplementary Material: pdf

Flagged For Ethics Review: true

Submission Number: 2398

Loading