Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance

Mario Michael Krell; Matej Kosec; Sergio P. Perez; Andrew W Fitzgibbon

Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance

Mario Michael Krell, Matej Kosec, Sergio P. Perez, Andrew W Fitzgibbon

Published: 01 Feb 2023, Last Modified: 27 Jun 2024Submitted to ICLR 2023Readers: Everyone

Keywords: deep learning, BERT, IPU, GPU, hardware-acceleration, padding, Wikipedia, NLP, bin-packing

TL;DR: Speed up BERT phase 2 pretraining by 2x (and other models, too) by avoiding padding without impacting accuracy in contrast to existing approaches.

Abstract: Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding tokens, so that all sequences in a batch have the same length. We show in this paper that the variation in sequence lengths in common NLP datasets is such that up to 50% of all tokens can be padding. In less common, but not extreme, cases (e.g. GLUE-COLA with sequence length 128), the ratio is up to 89%. Existing methods to address the resulting inefficiency are complicated by the need to avoid "cross-contamination" in self-attention, by a reduction in accuracy when sequence ordering information is lost, or by customized kernel implementations only valid for specific accelerators. This paper introduces a new formalization of sequence packing in the context of the well-studied bin packing problem, and presents new algorithms based on this formulation which, for example, confer a 2x speedup for phase 2 pretraining in BERT while preserving downstream performance. We show how existing models can be adapted to ensure mathematical equivalence between the original and packed models, meaning that packed models can be trained with existing pre-training and fine-tuning practices.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

13 Replies

Loading