CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation

ACL ARR 2024 June Submission5095 Authors

16 Jun 2024 (modified: 07 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Subword tokens in Indian languages inherently carry meaning, and isolating them can enhance NLP tasks, making sub-word segmentation a crucial process. Segmenting Sanskrit and other Indian languages into subtokens is not straightforward, as it may include sandhi, which may lead to changes in the word boundaries. We propose a new approach of utilizing a CharSS: Character-level Transformer model for Sanskrit Word Segmentation. We perform experiments on three benchmark datasets to compare the performance of our method against existing methods. On the UoH+SandhiKosh dataset, our method outperforms the current state-of-the-art system by an absolute gain of 6.72 points in split prediction accuracy. On the hackathon dataset our method achieves a gain of 2.27 points over the current SOTA system in terms of perfect match metric. We also propose a use-case of Sanskrit-based segments for a linguistically informed translation of technical terms to lexically similar low-resource Indian languages. In two separate experimental settings for this task, we achieve an average improvement of 8.46 and 6.79 chrF++ scores, respectively.
Paper Type: Short
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: Sanskrit Word segmentation, sandhivicchēda, Machine Translation
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data analysis
Languages Studied: Sanskrit, Hindi, Marathi, Gujarati, Tamil, Kannada, Odia
Submission Number: 5095
Loading