From Bias to Balance: How Multilingual Dataset Composition Affects Tokenizer Performance Across Languages

Aishwarya Selvamurugan; Raj Dandekar; Rajat Dandekar; Sreedath Panat

From Bias to Balance: How Multilingual Dataset Composition Affects Tokenizer Performance Across Languages

Aishwarya Selvamurugan, Raj Dandekar, Rajat Dandekar, Sreedath Panat

Published: 24 Sept 2025, Last Modified: 26 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multilingual tokenization, subword segmentation, tokenization bias, language fairness, Byte Pair Encoding, WordPiece, Unigram Language Model, balanced datasets, cross-lingual NLP, low-resource languages, computational equity, Part-of-Speech tagging, Named Entity Recognition, machine translation, subword fertility, normalized sequence length, multilingual BERT, tokenizer evaluation, linguistic diversity, morphologically rich languages

TL;DR: Balanced multilingual datasets reduce tokenizer bias, improving fairness and efficiency across languages, with tokenizer performance depending on both algorithm choice and vocabulary size.

Abstract: Tokenization serves as a crucial preprocessing step in multilingual language models, affecting performance in both high-resource and low-resource languages. However, current tokenizers seem to adopt language biases due to unbalanced training datasets, leading to a poorly optimized tokenizer for underrepresented languages. This research examines the impact of balanced multilingual datasets on the performance of tokenizers trained with the Byte Pair Encoding, WordPiece, and Unigram Language Model algorithms. We build balanced corpora from various sources to study the impact of vocabulary size on 15k, 30k, 50k dataset scales. The trained tokenizers are assessed through intrinsic metrics, including Subword Fertility and Normalized Sequence Length, as well as through extrinsic performance on downstream tasks like Part-of-Speech tagging, Named Entity Recognition, and Machine Translation. We build custom data sets along with customized evaluation pipelines to enable consistent comparisons across nine languages using models built into standard NLP frameworks. Our observations reinforce the importance of a balanced dataset when training tokenizers and, in turn, advance the development of equitable and robust multilingual NLP systems.

Submission Number: 103

Loading