Unreasonable effectiveness of LLM reasoning: a doubly cautionary tale of temporal question-answering

TMLR Paper5655 Authors

16 Aug 2025 (modified: 27 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The remarkable success of Large Language Models in modeling both the syntax and the semantics of language has prompted a body of research into language-adjacent abilities, most notably commonsense reasoning. As LLMs' performance continues to advance on successive benchmarks, we turn to temporal reasoning, which lags somewhat behind other tasks due to its more complex logic. We start from previous work, where authors successfully induce (apparent) reasoning by breaking down the problem into a two-step procedure of temporal graph extraction and subsequent reasoning. Specifically, in the first step an LLM is prompted to parse a natural language description into a semi-structured timeline of events; and in the second step, it is given the extracted timeline and prompted to answer a temporal reasoning question. We conjecture that this procedure presents two separate opportunities for introducing errors and further hypothesise that a Neuro-symbolic approach should help in this matter. We follow the recent trend of using external executors in concert with LLMs to carry out exact reasoning and verification. We see the reasoning step of the original two-step procedure as a natural target for a symbolic solver and design a rule-based solution for Temporal Question-Answering, drawing on ideas from Allen’s Interval Algebra. To our surprise, we find that our rule-based reasoner does not improve beyond the previously reported, purely neural solution. It appears that both our approach and the previous method operate at around the limits of achievable performance, imposed by the correctness of information extraction. Such a result seems to suggest that a non-symbolic LLM is capable of symbolic-level reasoning, although upon further investigation we discover that not to be the case. It is not that the neural solution makes no reasoning mistakes, but rather that the LLM manages to compensate for some of its erroneous replies by `short-cutting' to the correct answer in other questions; a.k.a. not reasoning but guessing. Although the effect is not pronounced performance-wise, we feel it is conceptually important: as we argue, production of correct answers is not a measure of reasoning.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Greg_Durrett1
Submission Number: 5655
Loading