Czert - Czech BERT-like Model for Language Representation

Jakub Sido; Ondrej Prazák; Pavel Pribán; Jan Pasek; Michal Seják; Miloslav Konopík

Czert - Czech BERT-like Model for Language Representation

Jakub Sido, Ondrej Prazák, Pavel Pribán, Jan Pasek, Michal Seják, Miloslav Konopík

Published: 01 Jan 2021, Last Modified: 17 Jul 2025RANLP 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper describes the training process of the first Czech monolingual language representation models based on BERT and ALBERT architectures. We pre-train our models on more than 340K of sentences, which is 50 times more than multilingual models that include Czech data. We outperform the multilingual models on 9 out of 11 datasets. In addition, we establish the new state-of-the-art results on nine datasets. At the end, we discuss properties of monolingual and multilingual models based upon our results. We publish all the pre-trained and fine-tuned models freely for the research community.

Loading