FLEXITOKENS: Flexible Tokenization for Evolving  Multilingual Language Models

Abraham Toluwase Owodunni; Orevaoghene Ahia; Sachin Kumar

FLEXITOKENS: Flexible Tokenization for Evolving Multilingual Language Models

Abraham Toluwase Owodunni, Orevaoghene Ahia, Sachin Kumar

18 Sept 2025 (modified: 27 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multilingual, Segmentation, Byte, Flexible, adapt

TL;DR: Existing tokenization methods are rigid, producing fixed token segments, we propose a new method to make tokens flexible and adaptive

Abstract: Multilingual language models are challenging to adapt to new data distributions by simple finetuning due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% improvements on downstream task performance compared to subword and other gradient-based tokenizers.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 10042

Loading