Keywords: reasoning, concepts
Abstract: Accuracy is the standard metric for evaluating LLM reasoning, but it conflates two distinct capabilities: understanding the underlying concept and executing it correctly. We introduce a two-phase evaluation framework that separates these concerns. A solver model attempts ConceptARC tasks requiring inductive reasoning from examples. A separate judge model evaluates only the reasoning trace, scoring conceptual understanding independent of output correctness. Across 480 evaluations (160 tasks × 3 passes), we find 38% show a mismatch: correct answers from flawed reasoning, or incorrect answers despite sound understanding. We analyze failure patterns across concept types, finding systematic weaknesses in spatial reasoning (Cohen’s d = 1.53) and 34% inconsistency across repeated attempts. Our results suggest accuracy alone significantly misrepresents reasoning capability, and different failure modes require different interventions. Finally, we note a key caveat: the concepts used in benchmarks like ConceptARC are human-defined and anthropocentric, while the internal abstractions LLMs use to reason may be very different. This motivates interpreting “concept understanding” scores as alignment with benchmark taxonomies, rather than a universal measure of conceptual structure.
Submission Number: 117
Loading