Two Are Better than One: Context Window Extension with Multi-Grained Self-Injection

Wei Han; Pan Zhou; Soujanya Poria; Shuicheng YAN

Two Are Better than One: Context Window Extension with Multi-Grained Self-Injection

Wei Han, Pan Zhou, Soujanya Poria, Shuicheng YAN

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: long-context modeling, large language models

Abstract:

Limited longtext window has been an inherent constraint for large language models (LLMs), which significantly restricts their application scenarios. Continual pre-training on long-context data is the most straightforward approach to further extend an LLM's context window, but it is at the expense of huge data acquisition and computation cost. There are many cost-efficient context window extension methods which do not require pretraining process emerges as appealing solutions, such as extrapolation, attention manipulation, context compression, etc. In this paper, we propose a novel approach named Shared-LLaMA. Shared-LLaMA is composed of two short-context LLMs. One of them works as compressor and the other works as decoder. The decoder receives compressed multi-grained context information from the compressor and performs context-aware modeling on the running text. Information transfer between the compressor and decoder occurs only at the lowest layers to circumvent an entire forward pass and save the inference time. Both LLMs are initialized from the same off-the-shelf checkpoint and thus can be directly trained without extra feature alignment stages. Additionally, we propose a tree structure to store the multi-grained information and design a search algorithm to fast locate and retrieve related information from each level of that tree. With these efficient design choices, Shared-LLaMA can greatly reduce memory consumption, and achieves apparent speed up over other advanced baselines (2$\times$ over streaming, $3\times$ over encoder-decoder architectures). In our evaluation on long-context modeling and understanding tasks, Shared-LLaMA yields superior or comparable results to several strong baselines, indicating Shared-LLaMA achieves a good balance between efficiency and effectiveness.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5960

Loading