Skill Decomposition and Composition: A Human-Like Evaluation Framework for Assessing LLMs' Reasoning Abilities

Skill Decomposition and Composition: A Human-Like Evaluation Framework for Assessing LLMs' Reasoning Abilities

ACL ARR 2025 February Submission1363 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities across tasks such as commonsense reasoning, mathematical problem-solving, and logical deduction. However, existing evaluation methods, which rely on average accuracy or structured reasoning tasks, provide limited insights into the underlying reasoning mechanisms of LLMs. Correct answers do not necessarily indicate robust reasoning, and coarse-grained metrics fail to guide meaningful improvements in reasoning performance. To address this, we propose a human-like reasoning evaluation framework inspired by skill decomposition and skill composition—key cognitive processes in human problem-solving. We introduce a pipeline leveraging state-of-the-art LLMs to automatically annotate skill labels for evaluation samples, enabling fine-grained analysis of reasoning capabilities. Our framework refines evaluation metrics by transitioning from accuracy-based measures to skill-level assessments, providing deeper insights into LLMs' reasoning processes. Experiments on diverse benchmarks reveal critical insights into LLMs' reasoning strengths and limitations, highlighting the importance of granular evaluation.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: LLM, Human-like reasoning, Skills

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 1363

Loading