Skill Decomposition and Composition: A Human-Like Evaluation Framework for Assessing LLMs' Reasoning Abilities
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities across tasks such as commonsense reasoning, mathematical problem-solving, and logical deduction. However, existing evaluation methods, which rely on average accuracy or structured reasoning tasks, provide limited insights into the underlying reasoning mechanisms of LLMs. Correct answers do not necessarily indicate robust reasoning, and coarse-grained metrics fail to guide meaningful improvements in reasoning performance. To address this, we propose a human-like reasoning evaluation framework inspired by skill decomposition and skill composition—key cognitive processes in human problem-solving. We introduce a pipeline leveraging state-of-the-art LLMs to automatically annotate skill labels for evaluation samples, enabling fine-grained analysis of reasoning capabilities. Our framework refines evaluation metrics by transitioning from accuracy-based measures to skill-level assessments, providing deeper insights into LLMs' reasoning processes. Experiments on diverse benchmarks reveal critical insights into LLMs' reasoning strengths and limitations, highlighting the importance of granular evaluation.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: LLM, Human-like reasoning, Skills
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1363
Loading