FACTOR: Factoring Complexity and Context Length in Long-Context Model Evaluation

Hongyi Liu; Zhuoming Chen; Yang Zhou; Beidi Chen

FACTOR: Factoring Complexity and Context Length in Long-Context Model Evaluation

Hongyi Liu, Zhuoming Chen, Yang Zhou, Beidi Chen

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long-context reasoning, Language models

TL;DR: We introduce the FACTOR benchmark to evaluate large language models' reasoning abilities over long contexts, modeling performance over complexity to reveal distinctive accuracy behaviors characterized by two factors.

Abstract: Large language models (LLMs) with extended context windows have shown remarkable capabilities, especially with contexts up to 128K tokens. However, whether these resource-intensive LLMs genuinely surpass simpler Retrieval Augmented Generation (RAG) techniques remains debated. We precisely delineate differences between long-context LLMs and RAG methods, emphasizing the unique long-context reasoning abilities of LLMs that RAG cannot replicate. Existing benchmarks often focus on retrieval tasks and contain weak if not none complex reasoning tasks, hindering assessment of reasoning over extended contexts. We introduce the \textbf{FACTOR} benchmark (\textbf{F}actoring \textbf{A}nalysis of \textbf{C}omplexity and \textbf{T}extual \textbf{C}ontext in \textbf{R}easoning), which evaluates LLMs by independently varying task complexity and context length. A comprehensive list of LLMs are evaluated on FACTOR. Besides mere accuracy scores, we also model the relationship between accuracy and complexity given the context length. A simple but consistent log-linear model works surprisingly well across various models. Also, the modeling contains two explainable parameters, the slope or Complexity Decay Factor (CDF) and the y-intercept or Contextual Decay Offset (CDO) that are shown to offer separate and insightful measures of the models' complex reasoning and long context innate ability. Our findings highlight distinct failure modes linked to task complexity and context length, underscoring the unique reasoning capabilities of long-context LLMs unattainable by RAG methods.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7490

Loading