InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability

ICML 2025 Workshop TokShop Submission47 Authors

Published: 10 Jun 2025, Last Modified: 11 Jun 2025TokShopEveryoneRevisionsBibTeXCC BY 4.0
Archiving Submission: Yes (archival)
Keywords: Neural Machine Translation, subword segmentation, Unigram Language Model tokenization, inline casing
TL;DR: We introduce two inline approaches to tokenization preprocessing of casing (InCa) and diacritics (InDia) in the texts, allowing the subword sequences to be shorter and more interpretable.
Abstract: We introduce two inline approaches to tokenization preprocessing of casing (InCa) and diacritics (InDia) in the texts. Their main component relies on an automatically created external dictionary that stores information about the most frequent casings or diacritizations of words, and marking only the non-frequent spellings. We show that in a number of noising scenarios, our casing algorithm shows the best performance, and in the cases where it performs on par with the alternative solutions, the intrinsic parameters of the tokenizer trained on our data are more stable. As for inline diacritization, this is the first solution of that type to our knowledge; we show its improvement on robustness against the de-diacritized texts compared to tokenization without preprocessing. We share our preprocessing systems at a public GitHub repository.
Submission Number: 47
Loading