Abugida Normalizer and Parser for Unicode texts

Published: 01 Jan 2023, Last Modified: 24 Feb 2025CoRR 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper proposes two libraries to address common and uncommon issues with Unicode-based writing schemes for Indic languages. The first is a normalizer that corrects inconsistencies caused by the encoding scheme https://pypi.org/project/bnunicodenormalizer/ . The second is a grapheme parser for Abugida text https://pypi.org/project/indicparser/ . Both tools are more efficient and effective than previously used tools. We report 400% increase in speed and ensure significantly better performance for different language model based downstream tasks.
Loading