Sequence Length Matters in Data Scheduling for Accelerating Language Model Pretraining

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sequence Length, Data Scheduling, Acceleration, Language Model Pretraining
TL;DR: We introduce a two-stage pretraining framework that leverages dense-balanced sequence length progression to accelerate language model pretraining
Abstract: Pretraining large language models (LLMs) is computationally intensive, often requiring billions of tokens and extensive compute to achieve competitive performance. Although recent advances in data selection have shown improvement on training efficiency, it is challenging for these methods to consistently maintain the promise in the context of the scaling law. In this work, we dive into the impact of sequence length with different linguistic structures and semantic continuity on language model pretraining, and propose a length-based online data scheduling method to accelerate the procedure. Specifically, we design a two-stage dense-balanced sequence prioritization framework for pretraining: 1) at the first stage, the model is exposed to uniform-length dense token batches to encourage the formation of global language representations; 2) the second stage incorporates variable-length sequences, which reinforces learned abstractions while significantly reducing the total number of training iterations. We hypothesize and prove that the model internalizes the foundational language knowledge during the dense-token phase, allowing it to optimize more efficiently the latter variable-length sequences. Empirical results show that our approach achieves comparable perplexity to standard pretraining while requiring substantially fewer optimization steps, pinpointing a promising way to reduce the computational burden of LLM pretraining.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10355
Loading