DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

Published: 05 Mar 2025, Last Modified: 17 Mar 2025Reasoning and Planning for LLMs @ ICLR2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmarking;NLP datasets;evaluation methodologies; evaluation; metrics;
TL;DR: We propose a benchmark for long text narrative reasoning called DetectiveQA, and present a variety of evaluation metric for assessing the ability of large language models, and find that existing models still have to be improved in this regard.
Abstract:

Recently, significant efforts have been devoted to enhancing the long-context capabilities of Large Language Models (LLMs), particularly in long-context reasoning. To facilitate this research, we propose DetectiveQA, a dataset specifically designed for narrative reasoning within long contexts. We leverage detective novels, averaging over 100K tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English, each paired with reference reasoning steps. Furthermore, we introduce a step-wise reasoning metric, which enhances the evaluation of LLMs' reasoning processes. We validate our approach and evaluate the mainstream LLMs, including GPT-4, Claude, and LLaMA, revealing persistent long-context reasoning challenges and demonstrating their evidence-retrieval challenges and data contamination issues. Our findings offer valuable insights into the study of long-context reasoning and lay the base for more rigorous evaluations.

Submission Number: 53
Loading