GATEAU: Selecting Influential Samples for Long Context Alignment

GATEAU: Selecting Influential Samples for Long Context Alignment

ACL ARR 2025 May Submission918 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Aligning large language models to handle instructions with extremely long contexts has yet to be fully investigated. Previous studies have attempted to scale up the available data volume by synthesizing long instruction-following samples, as constructing such a dataset tends to be challenging for annotators. However, a lack of a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the model’s performance. Thus, we propose GATEAU, a novel framework to address the unique challenge of long context alignment by identifying the influential samples enriched with long-range dependency relations. Specifically, GATEAU measures the long-range dependencies from two essential aspects: the difficulty of generating target responses due to the long-range dependencies, and the difficulty of understanding long inputs due to such dependencies. Comprehensive experiments indicate that GATEAU effectively identifies influential samples and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: data-efficient training, NLP in resource-constrained settings

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data analysis

Languages Studied: English, Chinese

Keywords: long context alignment, large language models, data selection, efficient instruction tuning

Submission Number: 918

Loading