MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

Dieuwke Hupkes; Nikolay Bogoychev

MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages

Dieuwke Hupkes, Nikolay Bogoychev

06 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: evaluation, LLMs, multilingual, language models

TL;DR: We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages.

Abstract: We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages. MultiLoKo consists of three partitions: a \texttt{main} partition consisting of 500 questions per language, separately sourced to be locally relevant to the specific language, and two \texttt{translated} partitions, containing human-authored translations from 30 non-English languages to English and vice versa. For comparison, we also release corresponding machine-authored translations. The data is equally distributed over two splits: a \texttt{development} split and a blind, out-of-distribution \texttt{test} split. MultiLoKo can be used to study a variety of questions regarding the multilinguality of LLMs as well as meta-questions about multilingual benchmark creation. We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance, their performance parity across languages, how much their ability to answer questions depends on the question language, and which languages are most difficult. \emph{None of the models we studied performs well on MultiLoKo}, as indicated by low average scores as well as large differences between the best and worst scoring languages. Furthermore, we find \emph{a substantial effect of the question language, indicating suboptimal knowledge transfer between languages}. Lastly, we find that using local vs English-translated data can result in differences \emph{more than 20 points for the best performing models, drastically changing the estimated difficulty of some languages}. For using machines instead of human translations, we find a weaker effect on ordering of language difficulty, a larger difference in model rankings, and a substantial drop in estimated performance for all models.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/facebook/multiloko

Code URL: https://github.com/facebookresearch/multiloko

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 587

Loading