Abstract: Large Language Models (LLMs) increasingly exhibit advanced abilities, enabled by techniques such as chain-of-thought prompting and test-time deliberation. However, they continue to struggle with tasks that demand complex reasoning, prompting debate over whether their outputs reflect genuine reasoning processes or merely statistical pattern generation. These difficulties stem in part from the absence of a unified framework for explaining and assessing reasoning in LLMs, which limits our ability to diagnose errors, establish bounds, and design effective interventions. In this paper, we propose a normative framework that characterizes reasoning as probabilistic inference over propositions and we show how this abstraction can be instantiated in LLMs. Within this framework, we provide a typology of reasoning modes, formalise success criteria for proposition-level correctness, and derive a taxonomy of failure modes. For each class, we map model-level requirements to LLM-level implementation constraints and identify potential remedies. Finally, we outline a roadmap for improving proposition-level accuracy under tractable approximations. Our contribution is both diagnostic and prescriptive: an account of what it means for LLMs to reason, where and why current systems fail, and how to close the gap.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Geoff_Pleiss1
Submission Number: 6642
Loading