CogSTEM: A Bloom’s Taxonomy-Grounded Benchmark for Diagnosing High-Order Capabilities in Large Language Models
Keywords: LLM Evaluation, AI for Education, STEM
Abstract: The rapid evolution of Large Language Models (LLMs) has sparked urgent demand for their integration as intelligent teaching assistants in STEM education. However, existing benchmarks often exhibit severe distributional biases, focusing disproportionately on factual recall or narrow procedural reasoning while neglecting the assessment of cognitive abilities essential for educational contexts. To address this, we introduce CogSTEM, a bilingual benchmark strictly aligned with the Revised Bloom's Taxonomy to achieve multi-dimensional equilibrium. Constructed through a rigorous human-in-the-loop annotation process, CogSTEM comprises $4,491$ high-quality samples that evaluate models across Disciplinary, Knowledge, and Cognitive dimensions. Our extensive evaluation reveals a critical cognitive disparity: while models excel in foundational Remembering tasks, they struggle significantly with high-order Analyzing problems, with even SOTA models facing substantial challenges. Furthermore, we demonstrate CogSTEM's practical utility via fine-tuning; experimental results show that Qwen series models achieve significant gains—specifically a $7.90\%$ surge in high-order evaluation capabilities—without compromising general proficiency. CogSTEM serves as a rigorous diagnostic framework for assessing and enhancing LLMs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,corpus creation,NLP datasets
Contribution Types: Data resources
Languages Studied: English, Chinese
Submission Number: 5533
Loading