BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language

Konrad Wojtasik, Kacper Wolowiec, Vadim Shishkin, Arkadiusz Janz, Maciej Piasecki

Published: 2024, Last Modified: 26 May 2026LREC/COLING 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR), garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr. TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark – a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language, marking a pioneering development in this field. The BEIR-PL is included in MTEB Benchmark and also available with trained models at URL https://huggingface.co/clarin-knext.