Track: long paper (up to 10 pages)
Keywords: reasoning, large language models
Abstract: Accuracy is the standard metric for evaluating LLM reasoning, but it conflates two distinct capabilities: understanding the underlying concept and executing it correctly. We introduce a two-phase evaluation framework that separates these concerns. A solver model attempts ConceptARC tasks requiring inductive reasoning from examples. A separate judge model evaluates only the reasoning trace, scoring conceptual understanding independent of output correctness. Across 480 evaluations (160 tasks × 3 passes), we find 38% show a mismatch: correct answers from flawed reasoning, or incorrect answers despite sound understanding. We analyze failure patterns across concept types, finding systematic weaknesses in spatial reasoning (Cohen’s d = 1.53) and 34% inconsistency across repeated attempts. Our results suggest accuracy alone significantly misrepresents reasoning capability, and different failure modes require different interventions.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 65
Loading