Piecing Together Clues: A Benchmark for Evaluating the Detective Skills of Large Language Models

Anonymous

Piecing Together Clues: A Benchmark for Evaluating the Detective Skills of Large Language Models

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Detectives frequently engage in information detection and reasoning simultaneously when making decisions across various cases, especially when confronted with a vast amount of information. With the rapid development of large language models~(LLMs), evaluating how these models identify key information and reason to solve questions becomes increasingly relevant. We introduces the DetectBench, a reading comprehension dataset designed to assess a model's ability to jointly ability in key information detection and multi-hop reasoning when facing complex and implicit information. The DetectBench comprises 3,928 questions, each paired with a paragraph averaging 190 tokens in length. To enhance model's detective skills, we propose the Self-Question Framework. These methods encourage models to identify all possible clues within the context before reasoning. Our experiments reveal that existing models perform poorly in both information detection and multi-hop reasoning. However, the Self-Question Framework approach alleviates this issue.

Paper Type: long

Research Area: Question Answering

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: english, chinese

0 Replies

Loading