AlephBERT: Pre-training and End-to-End Language Models Evaluation from Sub-Word to Sentence LevelDownload PDF

Anonymous

17 Jul 2021 (modified: 05 May 2023)ACL ARR 2021 July Blind SubmissionReaders: Everyone
Abstract: Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances. While advances reported for English using PLMs are unprecedented, reported advances using PLMs in Hebrew are few and far between. The problem is twofold. First, Hebrew resources for training large language models are not at the same order of magnitude as their English counterparts. Second, there are no accepted tasks and benchmarks to evaluate the progress of Hebrew PLMs on, and in particular, evaluation on sub-word (morphological) tasks. We aim to remedy both aspects. We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before. Moreover, we introduce a novel language-agnostic architecture that extracts all of the sub-word morphological segments encoded in contextualized word embedding vectors. Utilizing this new morphological component we offer a new PLM evaluation pipeline of multiple Hebrew tasks and benchmarks, that cover word-level, sub-word level and sentence level tasks. With AlephBERT we achieve state-of-the-art results compared against contemporary baselines. We make our AlephBERT model and evaluation pipeline publicly available, providing a single point of entry for evaluating and comparing Hebrew PLMs.
0 Replies

Loading