LongAttn: Selecting Long-context Training Data via Token-level Attention

ACL ARR 2025 February Submission6831 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with **long-range dependencies** is crucial. Existing methods to select long-context data often rely on sentence-level analysis,which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, **LongAttn**, which leverages the self-attention mechanism of LLMs to measure the **long-range dependencies** for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies **long-range dependencies**, enabling more accurate and efficient data selection. We filter **LongABC-32K** from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent **effectiveness**, **scalability**, and **efficiency**. We will release our code and the high-quality long-context dataset **LongABC-32K** in the future.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Long context, LLM, Continual Pre-training
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 6831
Loading