Scaling Laws or Threshold Effects: Exploring the Optimal Vocabulary Size for Balancing Performance and Efficiency in Low-Resource Languages

Scaling Laws or Threshold Effects: Exploring the Optimal Vocabulary Size for Balancing Performance and Efficiency in Low-Resource Languages

ACL ARR 2026 January Submission3706 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Low-resource Languages, Vocabulary Expansion, Tokenization, Byte-level BPE (BBPE), Scaling Laws, Pareto Optimality, Efficiency-Performance Trade-offs, Mongolian, Tibetan, Uyghur

Abstract: While vocabulary expansion scaling laws are well-established for high-resource languages, they remain unverified in low-resource settings. This gap is particularly critical for Byte-level BPE (BBPE), where constrained vocabulary sizes often fail to capture the rich morphemes of complex scripts, leading to severe over-segmentation in languages such as Mongolian, Tibetan, and Uyghur. We systematically investigate jointly-scaled trilingual vocabulary for these languages (140 to 195,000 tokens) across BPE (Llama 2) and BBPE (Qwen2.5/3) architectures. Our results reveal that BBPE follows a "decline-then-rise" pattern, requiring a 9,000-token threshold (3,000 per language) to trigger non-linear performance gains and inference acceleration, whereas BPE improves monotonically. Using Pareto Frontier Analysis, we identify an optimal 79,500-token configuration for BBPE that reduces continuous pre-training duration by over 71% across 1.5B to 8B parameter models while consistently enhancing downstream performance.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: low-resource NLP, multilingual tokenization, scaling laws, BBPE, Pareto optimality, Mongolian, Tibetan, Uyghur.

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: Mongolian, Tibetan, Uyghur, Chinese

Submission Number: 3706

Loading