Subword-based Cross-lingual Transfer of Embeddings from Hindi to MarathiDownload PDF

Anonymous

16 Oct 2021 (modified: 05 May 2023)ACL ARR 2021 October Blind SubmissionReaders: Everyone
Abstract: Word embeddings are growing to be a crucial resource in the field of NLP for any language. This work focuses on static subword embeddings transfer for Indian languages from a relatively higher resource language to a genealogically related low resource language. We work with Hindi-Marathi as our language pair, simulating a low-resource scenario for Marathi. We demonstrate the consistent benefits of unsupervised morphemic segmentation on both source and target sides over the treatment performed by FastText. We show that a trivial "copy-and-paste'' embeddings transfer based on even perfect bilingual lexicons is inadequate in capturing language-specific relationships. Our best-performing approach uses an EM-style approach to learning bilingual subword embeddings; the resulting embeddings are evaluated using the publicly available Marathi Word Similarity task as well as WordNet-Based Synonymy Tests. We find that our approach significantly outperforms the FastText baseline on both tasks; on the former task, its performance is close to that of pretrained FastText Marathi embeddings that use two orders of magnitude more Marathi data.
0 Replies

Loading