Abstract: Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances. While advances reported for English using PLMs are unprecedented, reported advances using PLMs for Hebrew are few and far between. The problem is twofold. First, Hebrew resources for training large language models have not been of the same magnitude as their English counterparts. Second, most bench marks available to evaluate progress in Hebrew NLP require morphological boundaries which are not readily available in the output of PLMs. In this work we aim to remedy both aspects. We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and larger dataset than any Hebrew PLM before. More over, we introduce a novel neural architecture that recovers the morphological segments encoded in contextualized embeddings. Based on this new morphological component we offer an evaluation suite consisting of multiple tasks and benchmarks, that cover both word-level and sub-word level analyses. On all tasks, AlephBERT obtains state-of-the-art results beyond all existing Hebrew models. We make AlephBERT, the morphological extraction model, and the Hebrew evaluation suite publicly available.
Paper Type: short
0 Replies
Loading