ArgQA: Evaluation of Reasoning Over Elementary Logical Structures in Arguments

17 Sept 2025 (modified: 15 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Logical Reasoning Benchmark, LLM evaluation
TL;DR: We present ArgQA, a novel dataset of multiple-choice questions to assess logical reasoning over elementary logical structures, based on authentic arguments from four distinct domains.
Abstract: As large language models advance in their reasoning capabilities, their adequate evaluation is becoming increasingly important. Existing logical reasoning benchmarks are typically constructed by automatically converting symbolic logic into natural language or curating questions from standardized exams, such as LSAT. However, both synthetic and exam-style questions are composed of unnatural language, thereby limiting their applicability to real-world contexts. Also, the systematic assessment of reasoning over diverse logical structures remains underexplored. Thus, we present ArgQA, a novel dataset of 3,807 multiple-choice questions based on authentic arguments from four distinct domains—product reviews, argumentative essays, e-rulemaking comments, and medical research abstracts. Each question is designed to assess the ability to recognize and reconstruct one of three elementary logical structures—linear, convergent, and divergent—whose understanding is a prerequisite to both simple and complex reasoning. Experiments show that even the strongest LLMs still have considerable room for improvement with the overall 9-shot accuracy ranging from 29.2% (Qwen-2) to 61.8% (GPT-o3).
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 8171
Loading