Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

ACL ARR 2025 May Submission7412 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Tokenization plays a critical role in multilingual NLP, yet current tokenizers often exhibit biases towards very high-resource languages. Despite the linguistic diversity and morphological richness of Indian languages, there's limited systematic analysis of tokenizer behaviour for them. This work presents a comprehensive intrinsic evaluation of tokenizer performance across 17 Indian languages. We quantify the trade-offs between bottom-up and top-down tokenizer algorithms (BPE and Unigram LM), effects of vocabulary sizes, and compare strategies of multilingual vocabulary construction such as joint and cluster-based training. We also show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages. Our study provides practical insights for building more fair, efficient, and linguistically informed tokenizers for multilingual NLP.

Paper Type: Short

Research Area: Phonology, Morphology and Word Segmentation

Research Area Keywords: Tokenization, morphology, multilinguality, Indian languages, intrinsic evaluation

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Position papers

Languages Studied: Hindi, Marathi, Sanskrit, Maithili, Bengali, Assamese, Malayalam, Tamil, Telugu, Kannada, Nepali, Urdu, Sindhi

Submission Number: 7412

Loading