Can long-context large language models understand long contexts?

Jiaqi Li; Mengmeng Wang; Zilong Zheng; Muhan Zhang

Can long-context large language models understand long contexts?

Jiaqi Li, Mengmeng Wang, Zilong Zheng, Muhan Zhang

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: long context, dataset, large language model, long and short term dependency

Abstract:

Large language models (LLMs) have received significant attention by achieving remarkable performance across various NLP tasks. However, the fixed context window length of the transformer architecture makes them incapable of memorizing and understanding extremely long inputs. There are tremendous works in designing effective and advanced techniques to enlarge LLMs' context window size, which call for high demands on developing high-quality benchmark datasets to evaluate LLMs' long context understanding ability. There are some existing datasets for this purpose. However, they face the problems of (1) shorter text length compared to modern LLMs' context window length, (2) out-of-date documents that may already been included in the training corpus of modern LLMs, and (3) most of the tasks are short dependency tasks---there are few questions that really need LLMs to collect information across the whole document (which we call. Most importantly, they hardly consider assessments on long dependency modeling and understanding across segments, which are particularly challenging and valuable for improving LLM long context. In this paper, we present LooGLE, a Long Context Generic Language Evaluation benchmark for LLM long context understanding. It contains up-to-date documents (all after 2022), over 24k tokens per document, and 6k newly generated questions from diverse domains and categories. Specifically, we recruited a group of human labelers to read 145 long documents in our benchmark, and asked them to compose about 1.1k QA pairs satisfying our long dependency requirements. These 1.1k high-quality QA pairs are each cross-validated 3 times by 2 labelers, aiming to provide the currently most accurate evaluation of LLMs' ability on long dependency questions. Upon a comprehensive evaluation of 8 state-of-the-art LLMs on LooGLE, we find that: (1) Commercial models generally outperform open-sourced models. (2) LLMs are more skilled at short dependency tasks like short QA and cloze but still struggle on performing real long dependency tasks. (3) In-context learning and chain of thoughts only bring incremental improvement for long context understanding. (4) Retrieval-based techniques significantly contribute to improvement on short QA whereas many techniques for extending context window length through optimized transformer architecture or positional encoding can hardly resolve long context understanding.

Supplementary Material: pdf

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5230

Loading