Beyond Unified Reasoning: State-Action-Critique Evaluation of Domain-Specific LLM Competencies

15 Sept 2025 (modified: 17 Sept 2025)Agents4Science 2025 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mathematical reasoning evaluation, Large language models, State-Action-Critique framework, Domain-specific competencies, Interactive evaluation, Multi-dimensional assessment, Arithmetic planning, Logic constraint satisfaction, Performance hierarchy inversion, Cognitive architectures, Reasoning trace analysis, Model specialization
TL;DR: We introduce a State-Action-Critique architecture that exposes how LLMs have fundamentally different cognitive structures for arithmetic versus logic reasoning, contradicting assumptions about unified mathematical intelligence.
Abstract: We introduce a State-Action-Critique evaluation framework that reveals fundamental domain specialization in large language model mathematical reasoning. Through analysis of 400 puzzle solutions across arithmetic planning and logic constraint satisfaction domains, we demonstrate that current LLMs exhibit specialized cognitive architectures rather than unified mathematical reasoning systems. Our multi-dimensional evaluation approach reveals complete performance hierarchy inversion featuring dramatic performance swings up to 54 points: arithmetic champions (Claude Opus, Gemini Pro) collapse to 46-50\% logic performance, while logic masters (Llama 4) achieve 98\% constraint satisfaction success but degrade to 86\% arithmetic performance. We expose the performance-explainability paradox where models achieving high correctness exhibit catastrophic coherence degradation. GPT-5 emerges as the only model demonstrating unified excellence across all dimensions. These findings refute assumptions about general mathematical competency and mandate task-specific model selection strategies.
Submission Number: 187
Loading