THiNK: Can Large Language Models Think-Aloud?

ACL ARR 2025 May Submission2813 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom’s Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to "think-aloud" through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking skills. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework, available at this anonymous link: https://anonymous.4open.science/r/THiNK-8F48, provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science.
Paper Type: Long
Research Area: Generation
Research Area Keywords: automatic evaluation, text-to-text generation, few-shot generation
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2813
Loading