Keywords: Long Context, Data Synthesis, Self-Attention, Context Length Extension, Large Language Models
Abstract: The advent of long-context Large Language Models (LLMs) has been hindered by a critical bottleneck: the scarcity of high-quality training data. Standard data synthesis methods, which typically concatenate short documents, often fail to create the challenging, long-range dependencies essential for robust learning. In this work, we introduce Long-Attention Weaving (LAW), a novel framework that leverages a model's own self-attention mechanism to synthesize a superior long-context training curriculum. LAW operates in two stages: first, it employs a multi-scale attention-based score to identify short documents that are inherently rich in long-range dependencies. Second, it utilizes a novel interleaving strategy to weave these selected documents into complex sequences, compelling the model to establish non-trivial, long-distance relationships. We demonstrate that continually pre-training LLaMA-2 7B on data synthesized by LAW extends its effective context window to 64k and significantly outperforms strong baselines on a suite of long-context benchmarks, LongBench. Our findings highlight the efficacy of attention-guided data engineering for unlocking the full potential of long-context LLMs. All code and data are available at https://anonymous.4open.science/r/LAW-B056.
Primary Area: datasets and benchmarks
Submission Number: 5644
Loading