Training a Bilingual Language Model by Aligning Tokens onto a Shared Character Space

Anonymous

Training a Bilingual Language Model by Aligning Tokens onto a Shared Character Space

Anonymous

17 Apr 2023ACL ARR 2023 April Blind SubmissionReaders: Everyone

Abstract: We train a bilingual Arabic-Hebrew language model in this study, using a transliterated version of Arabic texts to ensure representation by the same script. Given the morphological and structural similarities and large number of cognates in Arabic and Hebrew, we evaluate the performance of a language model that uses the same script for both languages on downstream tasks that require cross-lingual knowledge, such as machine translation. Promising results are obtained; our model outperforms all other PLMs on machine translation and outperforms other multilingual models in sentiment analysis for both languages.

Paper Type: short

Research Area: Multilinguality and Language Diversity

0 Replies

Loading