The PLLuM Instruction Corpus

Piotr Pezik, Filip Zarnecki, Konrad Kaczynski, Anna Cichosz, Zuzanna Deckert, Monika Garnys, Izabela Grabarczyk, Wojciech Janowski, Sylwia Karasinska, Aleksandra Kujawiak, Piotr Misztela, Maria Szymanska, Karolina Walkusz, Igor Siek, Maciej Chrabaszcz, Anna Kolos, Agnieszka Karlinska, Karolina Seweryn, Aleksandra Krasnodebska, Paula Betscher et al. (33 additional authors not shown)

Published: 2025, Last Modified: 07 May 2026CoRR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.

External IDs:dblp:journals/corr/abs-2511-17161