DeBLAS: Accelerate LLM Pretraining by Length-based Sequence Scheduling

08 Apr 2026 (modified: 22 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Pretraining large language models (LLMs) is computationally intensive, typically requiring massive datasets and training iterations. Although recent advances in data selection have shown improvement in training efficiency, their gains often diminish under scaling laws. In this work, we dive into the impact of sequence length on language model pretraining and propose a length-based online data scheduling method for acceleration. Specifically, we design a dense-balanced sequence scheduling framework for LLM pretraining: 1) at the first stage, the model is exposed to uniform-length dense token batches to encourage the formation of global language representations; 2) the second stage incorporates variable-length sequences, which reinforces learned abstractions while significantly reducing the total number of training iterations. We prove that the model internalizes the foundational language knowledge during the dense-batch phase, allowing it to optimize more efficiently on the latter variable-length sequences. Empirical results show that our approach achieves comparable perplexity to standard pretraining with substantially fewer optimization steps, pinpointing a promising way to reduce the computational burden of LLM pretraining.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Andriy_Mnih1
Submission Number: 8323
Loading