Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

ACL ARR 2026 January Submission9568 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-hop question answering, reasoning language models, evaluation
Abstract: The emergence of reasoning models and their integration into AI chatbots have led to breakthroughs in solving complex problems that requires multi-step thought processes. Yet, a complete understanding of their reasoning error patterns is missing. In this paper, we systematically investigate the reasoning failures of contemporary language models on multi-hop question answering tasks. We develop a novel, nuanced error analysis framework that evaluates the relevance and completeness of generated reasoning steps through four complementary dimensions: hop precision and hop recall over evidence contexts; overthinking that captures redundant or unnecessary reasoning; and question misinterpretation to reflect failures that arise prior to reasoning. Through rigorous human annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning quality, transparency, and robustness of future language modeling efforts.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Generation, Interpretability and Analysis of Models for NLP, Language Modeling, Question Answering, Resources and Evaluation
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 9568
Loading