RULERv2: From Basic Retrieval to Complex Reasoning, A Bottom-Up Benchmark for Long-Context Evaluation
Keywords: Long-context, Evaluation, Benchmark, Synthetic
TL;DR: We propose a novel synthetic benchmark RULERv2 to evaluate long-context language models
Abstract: Recent advances in long-context language models have spurred development of diverse benchmarks that often test multiple skills simultaneously, making it difficult to identify specific failure modes. To address this, we introduce RULERv2, a benchmark with systematic difficulty progression from basic synthetic retrieval to complex multi-step reasoning across three domains: multi-key NIAH, multi-value NIAH, and multi-doc QA.We conduct a large-scale evaluation of leading models, including seven closed-source and 26 open-weight models. Our findings reveal a notable performance gap between the two. Critically, we demonstrate that all models, including those claiming million-token context windows, exhibit performance degradation with increasing length, highlighting an unresolved challenge. Our analysis shows that explicit decomposition into a retrieve-then-solve strategy outperforms the implicit, single-step approach, and chain-of-thought reasoning enables models to discover effective decomposition autonomously. Finally, we find that even top-performing open-weight models struggle with fundamental retrieval and copying tasks, leading to degraded performance on more complex problems.
Primary Area: datasets and benchmarks
Submission Number: 4545
Loading