Feature-rich Open-vocabulary Interpretable Neural Representations for All of the World’s 7000 Languages

Anonymous

Feature-rich Open-vocabulary Interpretable Neural Representations for All of the World’s 7000 Languages

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Modern NLP research is firmly predicated on two assumptions: that very large corpora are available, and that the word, rather than the morpheme, is the primary meaning-bearing unit of language. For the vast majority of the world's languages, these assumptions fail to hold, and as a result existing state-of-the-art neural representations such as BERT fail to meet the needs of thousands of languages. In this paper, we present a novel general-purpose neural representation using Tensor Product Representations that is designed from the beginning to be both linguistically interpretable and fully capable of handling the broad variety found in the world's diverse set of 7000 languages, regardless of corpus size or morphological characteristics.We demonstrate the applicability of our representation through examples drawn from a typologically diverse set of languages whose morphology includes prefixes, suffixes, infixes, circumfixes, templatic morphemes, derivational morphemes, inflectional morphemes, and reduplication.

0 Replies

Loading