Revisiting Long-context Modeling from Context Denoising Perspective

ACL ARR 2025 February Submission3457 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent studies indicate that LCMs can be distracted by the context noise (irrelevant information). In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, i.e., IG score, for identifying noise information within the context. We also find that simply restraining the effect of noisy context can significantly boost the model's attention on critical tokens. Based on this observation, we propose a simple yet effective training strategy, CDT (Context Denoising Training), which can simultaneously strengthen the model's attention on critical tokens and achieve a stronger connection between these critical tokens and the model prediction. Experiments on both context window scaling and long-context alignment settings across 4 different tasks exhibit the superiority of CDT. With CDT, an open-source 8B model can achieve results (50.92 points) comparable to GPT4o (51.00 points)
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: fine-tuning
Contribution Types: Model analysis & interpretability, Reproduction study, Approaches to low-resource settings
Languages Studied: English
Submission Number: 3457
Loading