Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

ACL ARR 2026 January Submission10320 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Tokenization, Multilingual Tokenization
Abstract: Tokenization is the first---and often least scrutinized---step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with <UNK> placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity‑aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity‑aware BPE maximizes the compression gain of the currently worst‑compressed language, trading a small amount of global compression for cross‑lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on LM performance in downstream tasks.
Paper Type: Long
Research Area: Multilinguality and Language Diversity
Research Area Keywords: Multilingualism, Multilingual pre-training
Languages Studied: "en", "es", "de", "fr", "ru", "it", "pt", "pl", "ja", "vi", "tr", "nl", "id", "ar", "cs", "fa", "el", "zh", "hi", "ko", "th", "iw", "bn", "ta", "ka", "mr", "te", "no", "az", "sv", "ro", "uk", "hu", "da", "fi", "bg", "sk", "ca", "ms", "ur", "be", "eu", "tg", "st", "yo", "sw", "et", "lv", "gl", "cy", "sq", "mk", "ml", "my", "gu", "af", "fil", "haw", "uz"
Submission Number: 10320
Loading