Structured Packing in LLM Training Improves Long Context Utilization

Konrad Staniszewski; Szymon Tworkowski; Sebastian Jaszczur; Łukasz Kuciński; Piotr Miłoś

Structured Packing in LLM Training Improves Long Context Utilization

Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, Łukasz Kuciński, Piotr Miłoś

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: LLM, long-context, pretraining, context utilization, NLP, language models, data mixtures

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: The paper introduces k, a novel method for structured pretraining data that improves long-context utilization.

Abstract: Recent advances in long-context Large Language Models (LCLMs) have generated significant interest, especially in applications such as querying scientific research papers. However, their potential is often limited by inadequate context utilization. We identify the absence of long-range semantic dependencies in typical training data as a primary hindrance. To address this, we delve into the benefits of frequently incorporating related documents into training inputs. Using the inherent directory structure of code data as a source of training examples, we demonstrate improvements in perplexity, even for tasks unrelated to coding. Building on these findings, but with a broader focus, we introduce Structured Packing for Long Context (SPLiCe). SPLiCe is an innovative method for creating training examples by using BM25 to collate the most mutually relevant documents into a single training context. Our results indicate that SPLiCe enhances model performance across various tasks and can be used to train large models to utilize long contexts better. We validate our results by training a large 3B model, showing both perplexity improvements and better long-context performance on a benchmark key-value retrieval task.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5879

Loading