Abstract: Tokenization plays a critical role in multilingual NLP, yet current tokenizers often exhibit biases towards very high-resource languages. Despite the linguistic diversity and morphological richness of Indian languages, there's limited systematic analysis of tokenizer behaviour for them. This work presents a comprehensive intrinsic evaluation of tokenizer performance across 17 Indian languages. We quantify the trade-offs between bottom-up and top-down tokenizer algorithms (BPE and Unigram LM), effects of vocabulary sizes, and compare strategies of multilingual vocabulary construction such as joint and cluster-based training. We also show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages. Our study provides practical insights for building more fair, efficient, and linguistically informed tokenizers for multilingual NLP.
Paper Type: Short
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: Tokenization, morphology, multilinguality, Indian languages, intrinsic evaluation
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Position papers
Languages Studied: Hindi, Marathi, Sanskrit, Maithili, Bengali, Assamese, Malayalam, Tamil, Telugu, Kannada, Nepali, Urdu, Sindhi
Submission Number: 7412
Loading