PL-MTEB: Polish Massive Text Embedding Benchmark

PL-MTEB: Polish Massive Text Embedding Benchmark

ACL ARR 2026 January Submission10882 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: text embedding, Polish language, text classification, clustering, pair classification, information retrieval, semantic textual similarity

Abstract: In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in Polish. The PL-MTEB consists of 30 diverse NLP tasks from 5 task types, specifically classification, clustering, pair classification, information retrieval, and semantic textual similarity. As part of our work, We have verified the quality of the datasets available for the Polish language and prepared two new datasets, which were used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We made the prepared datasets, the source code for evaluation and the obtained results available to the public at [anonymized\_link].

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: text embedding

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: Polish

Submission Number: 10882

Loading