Abstract: Large Language Models (LLMs) have garnered significant attention worldwide due to their increasing size and improving capabilities. However, as LLMs continue to expand, traditional benchmark datasets are becoming less effective in evaluating their reasoning skills. This is primarily due to the difficulty of the tasks and issues with data contamination. Meanwhile, in the domain of logical reasoning, existing benchmarks often lack the ability to isolate specific reasoning abilities and fail to provide sufficient evidence for answer derivation. To address these issues, a novel dataset ILogicEval is proposed, which consists of sentences composed of unrelated statements, challenging LLMs to answer questions that cannot be solved based on their learned knowledge. ILogicEval is carefully designed to incorporate rich language diversity and assess the logical reasoning ability of LLMs independently of other reasoning skills, such as commonsense reasoning. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in ILogicEval and compare the performance of different popular LLMs in conducting logical reasoning. This dataset and evaluation metric address the limitations of existing benchmarks, providing a comprehensive assessment of the logical reasoning capabilities of LLMs.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
0 Replies
Loading