QMorphVec: A Morphologically-Aware Embedding of Quranic Vocabulary

NeurIPS 2024 Workshop MusIML Submission32 Authors

Published: 30 Nov 2024, Last Modified: 01 Dec 2024MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Morphologically-Aware Embedding, Quranic Embedding
TL;DR: QMorphVec
Abstract: Developing effective word representations that incorporate linguistic features and capture contextual information is an essential step in natural language processing (NLP) tasks. When working with a text corpus from a specific domain with profound meanings, such as the Holy Quran, deriving word representations based on domain-specific textual contexts is particularly valuable. In this research, we employ a context-masking approach to generate separate embedding spaces for Quranic roots, lemmas, and surface forms, and then project them into a common space through linear mapping. We demonstrate that our in-domain embeddings, trained solely on Quranic text and it morphological contexts, perform comparably to—and, in some cases, better than—OpenAI's large embeddings while surpassing the multilingual XLM-R embeddings. Additionally, through qualitative analysis, we illustrate their utility in Quranic word analogy tasks. The code and the embeddings are available at: [anonymized for the double-blinded review].
Submission Number: 32
Loading