CORRECTNESS WITHOUT CONCEPTS? EVALUATING LLM REASONING BEYOND ANTHROPOCENTRIC TAXONOMIES

Vimanyu Taneja; Soumya Banerjee

CORRECTNESS WITHOUT CONCEPTS? EVALUATING LLM REASONING BEYOND ANTHROPOCENTRIC TAXONOMIES

Vimanyu Taneja, Soumya Banerjee

18 Feb 2026 (modified: 15 Mar 2026)Submitted to 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning, concepts

Abstract: Accuracy is the standard metric for evaluating LLM reasoning, but it conflates two distinct capabilities: understanding the underlying concept and executing it correctly. We introduce a two-phase evaluation framework that separates these concerns. A solver model attempts ConceptARC tasks requiring inductive reasoning from examples. A separate judge model evaluates only the reasoning trace, scoring conceptual understanding independent of output correctness. Across 480 evaluations (160 tasks × 3 passes), we find 38% show a mismatch: correct answers from flawed reasoning, or incorrect answers despite sound understanding. We analyze failure patterns across concept types, finding systematic weaknesses in spatial reasoning (Cohen’s d = 1.53) and 34% inconsistency across repeated attempts. Our results suggest accuracy alone significantly misrepresents reasoning capability, and different failure modes require different interventions. Finally, we note a key caveat: the concepts used in benchmarks like ConceptARC are human-defined and anthropocentric, while the internal abstractions LLMs use to reason may be very different. This motivates interpreting “concept understanding” scores as alignment with benchmark taxonomies, rather than a universal measure of conceptual structure.

Submission Number: 117

Loading