Optimizing Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models

ACL ARR 2025 February Submission5295 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Packing and shuffling tokens is a common practice in training auto-regressive language models to prevent overfitting and improve efficiency. Documents are typically concatenated to chunks of maximum sequence length (MSL) and shuffled in chunks of tokens (atom-size chunk), possibly breaking context within documents. An alternative approach is padding, which only includes one document per chunk. To optimize both packing strategies (concatenation vs padding), we explored the optimal atom size for shuffling and compared performance and efficiency. We found that in the most common setup (where average document length is greater than MSL), matching atom size to MSL yields the lowest perplexity, controlling for dataset. Also, padding yields lower final perplexity than concatenation at the cost of lower efficiency. This trade-off informs the choice of shuffling and packing methods in training LMs.
Paper Type: Short
Research Area: Language Modeling
Research Area Keywords: pre-training
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5295
Loading