Scaling Laws or Threshold Effects: Exploring the Optimal Vocabulary Size for Balancing Performance and Efficiency in Low-Resource Languages
Keywords: Low-resource Languages, Vocabulary Expansion, Tokenization, Byte-level BPE (BBPE), Scaling Laws, Pareto Optimality, Efficiency-Performance Trade-offs, Mongolian, Tibetan, Uyghur
Abstract: While vocabulary expansion scaling laws are well-established for high-resource languages, they remain unverified in low-resource settings. This gap is particularly critical for Byte-level BPE (BBPE), where constrained vocabulary sizes often fail to capture the rich morphemes of complex scripts, leading to severe over-segmentation in languages such as Mongolian, Tibetan, and Uyghur. We systematically investigate jointly-scaled trilingual vocabulary for these languages (140 to 195,000 tokens) across BPE (Llama 2) and BBPE (Qwen2.5/3) architectures. Our results reveal that BBPE follows a "decline-then-rise" pattern, requiring a 9,000-token threshold (3,000 per language) to trigger non-linear performance gains and inference acceleration, whereas BPE improves monotonically. Using Pareto Frontier Analysis, we identify an optimal 79,500-token configuration for BBPE that reduces continuous pre-training duration by over 71% across 1.5B to 8B parameter models while consistently enhancing downstream performance.
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: low-resource NLP, multilingual tokenization, scaling laws, BBPE, Pareto optimality, Mongolian, Tibetan, Uyghur.
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: Mongolian, Tibetan, Uyghur, Chinese
Submission Number: 3706
Loading