Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack

Published: 18 Jun 2024, Last Modified: 16 Jul 2024LCFM 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-Context Language Models, Lifelong Learning, In-Context Learning
Abstract: We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn from a sequence of tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite designed for assessing and diagnosing how long-context LMs utilize long contexts in the Lifelong ICL setting. When given a task instruction and test inputs, long-context LMs are expected to leverage the same-task demonstrations in the Lifelong ICL prompt, avoid distraction from other tasks, and achieve a test accuracy no worse than the single-task ICL baseline. Task Haystack draws inspiration from the widely-adopted ``needle-in-a-haystack'' (NIAH) evaluation, but presents new and unique challenges. It demands that models (1) utilize the context with deeper understanding, rather than resorting to simple copying and pasting; (2) navigate through long streams of evolving topics and tasks, which closely approximates the complexities of real-world scenarios faced by long-context LMs. Additionally, Task Haystack inherits the controllability aspect of NIAH, providing model developers with tools to identify model vulnerabilities effectively. We benchmark ten long-context LMs using Task Haystack. We find that state-of-the-art closed models such as GPT-4o still struggle in this setting, failing 15% of the cases on average, while all open models we evaluate further lack behind by a large margin. Further, we design controlled analysis and find that current long-context models are prone to distractibility and recency bias, as well as other limitations in robustness and instruction understanding.
Submission Number: 41
Loading