Keywords: Tokenization, Tamil, Sandhi, Morphophonemic alternations, Byte Pair Encoding (BPE), Grapheme Pair Encoding (GPE), Subword modeling, NLP, Indic languages
TL;DR: Agathiyam is a Sandhi-aware Tamil tokenization framework that outperforms BPE and GPE by capturing morphophonemic boundaries, reducing OOVs, improving perplexity, and yielding compact, semantically rich tokens for Indic NLP.
Abstract: Tokenization is a foundational step in Natural Language Processing (NLP); however, prevailing methodologies such as Byte Pair Encoding (BPE) and Grapheme Pair Encoding (GPE) exhibit notable limitations when applied to morphologically rich and agglutinative languages, including Tamil. These methods often produce excessive segmentation of lexical units and inadequately capture Sandhi phenomena, wherein morphophonemic alternations occur at word junctures. To address these shortcomings, we propose \textbf{Agathiyam}, a Sandhi-aware tokenization framework tailored for Tamil. This framework integrates rule-based detection of Sandhi boundaries with grapheme-level subword modeling, thereby producing tokenizations that are both linguistically grounded and computationally efficient. Agathiyam is empirically evaluated on large-scale Tamil corpora, specifically leveraging the \textbf{Samanantar dataset}, and benchmarked against BPE and GPE baselines using standard tokenization metrics. Experimental findings reveal that Agathiyam achieves superior compression ratios, lower fertility scores, reduced out-of-vocabulary (OOV) instances, and enhanced perplexity, thus yielding tokens that are compact yet semantically expressive. By embedding Sandhi-awareness within the tokenization pipeline, Agathiyam establishes a robust alignment between linguistic structure and subword representation, offering a scalable framework for advancing tokenization in Tamil and, by extension, other Indic languages with comparable morphophonemic complexity.
Primary Area: generative models
Submission Number: 23254
Loading