RULERv2: From Basic Retrieval to Complex Reasoning, A Bottom-Up Benchmark for Long-Context Evaluation

Published: 24 Sept 2025, Last Modified: 09 Oct 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-context, Evaluation, Benchmark, Synthetic
TL;DR: We propose a novel synthetic benchmark RULERv2 to evaluate long-context language models
Abstract: Recent progress in long-context language models has spurred development of diverse benchmarks that often test multiple skills simultaneously, making it difficult to pinpoint specific reasons for model failures. To address this, we introduce RULERv2, a benchmark with systematic difficulty progression from basic synthetic retrieval to complex problem-solving across three domains: multi-key NIAH, multi-value NIAH, and multi-doc QA. We conduct a large-scale evaluation of leading models, including seven closed-source and 27 open-weight models. Our findings reveal a notable performance gap between the two. Critically, we demonstrate that all models, including those claiming million-token context windows, exhibit performance degradation with increasing length, highlighting an unresolved challenge. Our analysis shows that explicit decomposition into a retrieve-then-solve strategy outperforms implicit, single-step approaches, and chain-of-thought reasoning enables models to discover effective decomposition autonomously. Finally, we find that even top-performing open-weight models struggle with fundamental retrieval and copying tasks, leading to degraded performance on more complex problems.
Submission Number: 81
Loading