Abstract: The shared lexicon can reveal genealogical relationships between languages in a linguistic area. However, widespread cross-linguistic borrowings have increasingly blurred traditional phylogenetic distinctions based on lexical similarities, leading to a distorted perception of language clusters based on prior diachronic knowledge. To better understand language clusters at a synchronic level, including the influence of borrowings, this study investigates the relatedness of 9 Indic languages by leveraging the lexical knowledge of pre-trained language models: mBERT, XLM-RoBERTa, IndicNLP, and MuRIL. We extract the embeddings of cognate reflexes from the CogNet dataset for the selected languages. By performing hierarchical agglomerative clustering on the embedding-based cosine similarity scores of language pairs, we identify language clusters that reflect contemporary language groupings, carefully considering the impact of borrowings. This study also aims to assess how well word embedding-based lexical similarity aligns with string similarity-based genealogical clustering and the actual phylogenetic groupings. The results demonstrate that cognates play a crucial role in extracting phylogenetic signals, even when using pre-trained language models.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism; language contact; less-resourced languages
Contribution Types: Approaches to low-resource settings, Data analysis
Languages Studied: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, Telugu
Submission Number: 2155
Loading