Abstract: This study intends to systematically disentangle pure logic reasoning and text understanding by investigating the contrast across abstract and contextualized logical problems from a comprehensive set of domains. We explore whether LLMs demonstrate genuine reasoning capabilities across various domains when the underlying logical structure remains constant. We focus on two main questions (1) Can abstract logical problems alone accurately benchmark LLMs' reasoning ability in real-world scenarios, disentangled from contextual support in practical settings? (2) Does fine-tuning \llms on abstract logic problems generalize to contextualized logic problems and vice versa? To investigate these questions, we focus on standard propositional logic, specifically propositional deductive and abductive logic reasoning. We construct datasets for both reasoning types with four difficulty levels across 12 distinct domains based on the Wikipedia categorization in addition to those with purely abstract variables. Our experiments aim to provide insights into disentangling context in logical reasoning, the genuine reasoning capabilities of \llms, and their generalization potential. Coda and data are available at \url{https://anonymous.4open.science/r/ContextHub-957E}.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: large language model, evaluation
Contribution Types: Data resources
Languages Studied: English
Submission Number: 2824
Loading