Keywords: text embedding, Polish language, text classification, clustering, pair classification, information retrieval, semantic textual similarity
Abstract: In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in Polish. The PL-MTEB consists of 30 diverse NLP tasks from 5 task types, specifically classification, clustering, pair classification, information retrieval, and semantic textual similarity. As part of our work, We have verified the quality of the datasets available for the Polish language and prepared two new datasets, which were used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We made the prepared datasets, the source code for evaluation and the obtained results available to the public at [anonymized\_link].
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: text embedding
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: Polish
Submission Number: 10882
Loading