CogSTEM: A Bloom’s Taxonomy-Grounded Benchmark for Diagnosing High-Order Capabilities in Large Language Models

CogSTEM: A Bloom’s Taxonomy-Grounded Benchmark for Diagnosing High-Order Capabilities in Large Language Models

ACL ARR 2026 January Submission5533 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Evaluation, AI for Education, STEM

Abstract: The rapid evolution of Large Language Models (LLMs) has sparked urgent demand for their integration as intelligent teaching assistants in STEM education. However, existing benchmarks often exhibit severe distributional biases, focusing disproportionately on factual recall or narrow procedural reasoning while neglecting the assessment of cognitive abilities essential for educational contexts. To address this, we introduce CogSTEM, a bilingual benchmark strictly aligned with the Revised Bloom's Taxonomy to achieve multi-dimensional equilibrium. Constructed through a rigorous human-in-the-loop annotation process, CogSTEM comprises $4,491$ high-quality samples that evaluate models across Disciplinary, Knowledge, and Cognitive dimensions. Our extensive evaluation reveals a critical cognitive disparity: while models excel in foundational Remembering tasks, they struggle significantly with high-order Analyzing problems, with even SOTA models facing substantial challenges. Furthermore, we demonstrate CogSTEM's practical utility via fine-tuning; experimental results show that Qwen series models achieve significant gains—specifically a $7.90\%$ surge in high-order evaluation capabilities—without compromising general proficiency. CogSTEM serves as a rigorous diagnostic framework for assessing and enhancing LLMs.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,corpus creation,NLP datasets

Contribution Types: Data resources

Languages Studied: English, Chinese

Submission Number: 5533

Loading