DeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning

Atharva Pandey; Kshitij Dubey; Rahul Sharma; Amit Sharma

DeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning

Atharva Pandey, Kshitij Dubey, Rahul Sharma, Amit Sharma

Published: 09 Jul 2025, Last Modified: 16 Jul 2025AI4Math@ICML25 PosterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Deduction, Chain of Thought, Reasoning, GSM8k

Abstract: Deductive reasoning is a key component in solving complex problems, especially those involving math and logic. Formally, deductive reasoning involves two subtasks: understanding a set of input premises and inferring the conclusions that follow from them. Recent work highlights deficiencies in deductive reasoning of language models (LMs) by measuring final accuracy. Going beyond accuracy, we propose a metric to directly characterize deductive reasoning, which allows comparison of LMs' capabilities across both deductive subtasks and guides where finetuning efforts may help the most. The ideal evaluation would require access to an oracle system that can verify any candidate conclusion from an LM given input premises, which in turn requires computing the deductive closure. Instead, we propose a practical solution that requires access to only one correct solution for a problem and measures deductive consistency (DC) over varying windows of reasoning steps. By breaking up LMs' reasoning steps into dynamic windows, we can directly evaluate the two subtasks: how well do LMs understand input premises with increasing context lengths, and how well can they infer conclusions over multiple reasoning hops? Since existing benchmarks may be memorized, we also develop a pipeline to evaluate LMs' deductive consistency on novel, perturbed versions of benchmark problems. Our key result is that LMs are more robust to processing input premises of varying lengths than inferring conclusions over a longer horizon. For instance, on datasets such as GSM-8k and ProntoQA, DC of LMs stays the same regardless of the length of prefixes; the key source of error is the number of output reasoning steps. Labeling the reasoning errors reveals that a significant fraction of the errors are calculation or logical errors. Applying prevalent mitigation techniques such as fine-tuning or tool use reduces some kind of errors but cannot fully remove the decay in DC.

Submission Number: 26

Loading