Measuring Reasoning in LLMs: a New Dialectical Angle

ICLR 2026 Conference Submission20389 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Reasoning, Dialectics, Language Models, Evaluation
TL;DR: We introduce SIEV, a structured, process-driven approach to evaluating LLM reasoning, exposing weaknesses that accuracy-only evaluations fail to detect.
Abstract: What does it truly mean for a language model to “reason”? Most current evaluations and benchmarks reward models' correct standalone answers—but correctness alone reveals little about the process that produced them. In this work, we explore a different perspective: reasoning is not a static chain of steps, but a dynamic trajectory where ideas interact, clash, and evolve into deeper insights. To capture this dynamic, we draw on a well-established philosophical tradition: dialectics, where reasoning unfolds through thesis, antithesis, and synthesis. Building on this, we present SIEV, a structured framework that evaluates reasoning of LLMs through dialectics. Unlike conventional evaluations, SIEV assesses not only the conclusion a model reaches, but how it gets there: its ability to resolve tension, integrate distinct ideas, and synthesize higher-order reasoning. This lens uncovers significant reasoning gaps in state-of-the-art models even under saturated benchmarks like GSM and MMLU. For instance, GPT-5-chat, a recent model, loses over 40 points (out of 100) when evaluated with SIEV on GSM. Our findings highlight that adopting a process-oriented, philosophically grounded approach enables a deeper, more rigorous, and more discriminative assessment of LLM reasoning.
Primary Area: datasets and benchmarks
Submission Number: 20389
Loading