Study of Tokenization Strategies for the Santhali Language

Published: 2024, Last Modified: 06 Nov 2025SN Comput. Sci. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Santhali is one of the popular local languages and mainly spoken in the ruler areas of Jharkhand, Odisha, and West Bengal in Bharat. Santhali language is area specific language that is popular mainly among the tribe population. Hence, the technical aspects have not been explored. Applying different natural language processing (NLP) methods in the Santhali language is challenging. Translating the Santhali language to any other language tokenization plays a vital role, and it is the first step of language tokenization. The tokenization reduces the sentence of a paragraph to an atomic meaningful word, and it can be a single word and structure. Multiple methods are used to tokenize a language. The NLP-related tasks as popular Indian languages in India have developed technologically and are rich in resource availability. Santhali language is a resource language and does not see much use of technology. The present work encompasses the tokenization of the Santhali language. The present work has applied the four different tokenization methods to the Santhali language. It has been observed that their outputs differ from the tiniest character categories to the most general form of words. Ol-Chiki font has been used for the present study. A detailed comparison of the tokenization is presented in the present manuscript. It has been noted that almost all of the Santhali characters are composed of real-world symbols and an alphabetic writing system. spaCy model performs better than other tokenization approaches. We also validated the different tokenization methods in different paragraphs written in the Ol-Chiki script for the Santhali language.
Loading