Cognate and Contact-Induced Transfer Learning for Hamshentsnag: A Low-Resource and Endangered Language

NAACL 2025 Workshop LM4UC Submission13 Authors

Published: 04 Mar 2025, Last Modified: 22 Mar 2025LM4UCEveryoneRevisionsBibTeXCC BY 4.0
Keywords: cross-lingual transfer, sequence tagging, endangered languages, language contact, Hamshentsnag
TL;DR: Not only closely related languages but also contact languages can help build better NLP tools for endangered languages, as shown by POS tagging and NER experiments on Hamshentsnag.
Abstract: This study investigates zero-shot and few-shot cross-lingual transfer effects in Part-of-Speech (POS) tagging and Named Entity Recognition (NER) for Hamshentsnag, an endangered Western Armenian dialect. We examine how different source languages, Western Armenian (contact cognate), Eastern Armenian (ancestral cognate), Turkish (substrate or contact-induced), and English (non-cognate), affect the task performance using multilingual BERT and BERTurk. Results show that cognate varieties improved POS tagging by 8\% F1, while the substrate source enhanced NER by 15\% F1. BERTurk outperformed mBERT on NER but not on POS. We attribute this to task-specific advantages of different source languages. We also used script conversion and phonetic alignment with the target for non-Latin scripts, which alleviated transfer.
Archival: Archival Track
Participation: Virtual
Presenter: Onur Keleş and Baran Günay
Submission Number: 13
Loading