Keywords: LLMs, Long Context, Evaluation
Abstract: As the context limits of Large Language Models (LLMs) increase, the range of
possible applications and downstream functions broadens. In many real-world
tasks, decisions depend on details scattered across collections of often disparate
documents containing mostly irrelevant information. Long-context LLMs appear
well-suited to this form of complex information retrieval and reasoning, which has
traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address
this, we conduct a set of retrieval experiments designed to evaluate the capabilities
of 17 leading LLMs, such as their ability to follow threads of information through
the context window. Strikingly, we find that many models are remarkably thread-
safe: capable of simultaneously following multiple threads without significant loss
in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as
the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared—they often
correspond to substantially different numbers of written characters. We release
our code and long context experimental data.
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3859
Loading