Morpheme Induction for Emergent Language

ACL ARR 2025 May Submission4180 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm which (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.
Paper Type: Long
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: morphological segmentation
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: English, German, emergent languages, synthetic languages
Submission Number: 4180
Loading