RULERv2: From Basic Retrieval to Complex Reasoning, A Bottom-Up Benchmark for Long-Context Evaluation

Cheng-Ping Hsieh; Faisal Ladhak; Krishna C Puvvada; Boris Ginsburg

RULERv2: From Basic Retrieval to Complex Reasoning, A Bottom-Up Benchmark for Long-Context Evaluation

Cheng-Ping Hsieh, Faisal Ladhak, Krishna C Puvvada, Boris Ginsburg

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long-context, Evaluation, Benchmark, Synthetic

TL;DR: We propose a novel synthetic benchmark RULERv2 to evaluate long-context language models

Abstract: Recent advances in long-context language models have spurred development of diverse benchmarks that often test multiple skills simultaneously, making it difficult to identify specific failure modes. To address this, we introduce RULERv2, a benchmark with systematic difficulty progression from basic synthetic retrieval to complex multi-step reasoning across three domains: multi-key NIAH, multi-value NIAH, and multi-doc QA.We conduct a large-scale evaluation of leading models, including seven closed-source and 26 open-weight models. Our findings reveal a notable performance gap between the two. Critically, we demonstrate that all models, including those claiming million-token context windows, exhibit performance degradation with increasing length, highlighting an unresolved challenge. Our analysis shows that explicit decomposition into a retrieve-then-solve strategy outperforms the implicit, single-step approach, and chain-of-thought reasoning enables models to discover effective decomposition autonomously. Finally, we find that even top-performing open-weight models struggle with fundamental retrieval and copying tasks, leading to degraded performance on more complex problems.

Primary Area: datasets and benchmarks

Submission Number: 4545

Loading