Introducing huBERT

Dávid Márk Nemeskey

Published: 28 Jan 2021, Last Modified: 22 Feb 2024Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2021)EveryoneCC BY-SA 4.0

Abstract: This paper introduces the huBERT family of models. The flag- ship is the eponymous BERT Base model trained on the new Hungarian Webcorpus 2.0, a 9-billion-token corpus of Web text collected from the Common Crawl. This model outperforms the multilingual BERT in masked language modeling by a huge margin, and achieves state-of-the-art performance in named entity recognition and NP chunking. The models are freely downloadable.