AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence LevelDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances.While advances reported for English using PLMs are unprecedented, reported advances using PLMs for Hebrew are few and far between.The problem is twofold.First, so far, Hebrew resources for training large language models are not of the same magnitude as their English counterparts.Second, there are no accepted benchmarks to evaluate the progress of Hebrew PLMs on, and in particular, sub-word (morphological) tasks.We aim to remedy both aspects.We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before.Moreover, we introduce a novel language-agnostic architecture that can recover all of the sub-word morphological segments encoded in contextualized word embedding vectors.Based on this new morphological component we offer a new PLM evaluation suite consisting of multiple tasks and benchmarks, that cover sentence level word-level and sub-word level analyses.On all tasks, AlephBERT obtains state-of-the-art results beyond contemporary Hebrew baselines. We make our AlephBERT model, the morphological extraction mode, and the Hebrew evaluation suite publicly available, providing a single point of entry for assessing Hebrew PLMs.
0 Replies

Loading