Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

ACL ARR 2025 February Submission2561 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.
Paper Type: Short
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: morphological segmentation,
Contribution Types: Approaches to low-resource settings
Languages Studied: Arapaho, Gitksan, Lezgi, Natugu, Tsez, Nyangbo, Uspanteko
Submission Number: 2561
Loading